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LEO DO CAL ON 


Parole inaugurali 


DI 


GIOVANNI POLVANI 


Presidente della Societa Italiana di Fisica 


Rinnoviamo oggi a quindici giorni di distanza questa breve cerimonia inau- 
gurale per questo II Corso, 1958 — VII della serie iniziata nel 1953 —, il quale, 
dopo quello chiusosi sabato passato relativamente alla Fisica del plasma, viene 
questo anno tenuto, a cura della Scuola Internazionale di Fisica della nostra 
Società Italiana di Fisica, sulla Teoria dell’Informazione, disciplina di grande 
attualità e novità e certamente di grande possibilità di applicazione ai più 
svariati campi sia della conoscenza pura sia della tecnologia, comprendendo 
in questo termine anche i mezzi di scambio, di critica e di produzione della 
stessa conoscenza. 

E, come le altre volte, estremamente gradito mi è porgere, anche a nome 
della Società, il saluto più cordiale a tutti i presenti, specie a S. E. 
l’on. SCAGLIA, Sottosegretario alla Pubblica Istruzione, intervenuto in rappre- 
sentanza del Ministro prof. Moro; e porgere anche a tutti i partecipanti al 
Corso il benvenuto alla nostra Scuola e a Varenna. 

I quali partecipanti non sarebbe del tutto corretto distinguere, come vor- 
rebbe la consueta organizzazione dei Corsi della nostra Scuola, nelle categorie 
di docenti, allievi e uditori. 

La ragione è che questa nuova disciplina, la Teoria della Informazione, se 
proprio disciplina autonoma si può chiamare —, è ancora in via di forma- 
zione. È quindi opportuno che le « lezioni » vere e proprie possano talora cedere 
il passo, se necessario, alle discussioni, all’esposizione di risultati appena con- 
seguiti o di orientamenti intuiti in ricerche originali in corso...: in altre parole 
occorre che la «scuola » si trasformi in « convegno » (in «simposio », oggi si 
direbbe) o addirittura in pura e semplice « conversazione scientifica ». I parte- 
cipanti potrebbero allora esser considerati sotto l’unico aspetto di « studiosi »; 
e, volendo mantenere il termine di « docente », « docenti » possiamo allora chia- 
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mare quelli che si sobbarcheranno alla fatica di tenere alcune lezioni, svolgere 
conferenze e seminari, portare il contributo critico della propria scienza e pe- 
rizia nel vivo della discussione, nei momenti di dubbio, di esitazione ecc.; e 
gli altri chiamare « allievi » 0, meglio, « studenti » solo perchè vogliono, soprat- 
tutto, studiare, apprendere, penetrare questa nuova disciplina che è la Teoria 
dell’Informazione. 

Accettata questa divisione, lasciate che, come alle altre inaugurazioni, 10 
li presenti reciprocamente questi partecipanti al Corso, questi « ospiti » — chè 
questo è il termine pieno —, questi ospiti graditissimi; e lasciate anche che 
alla regola generale di nominarli in ordine alfabetico faccia un’eccezione, certo 
da tutti approvata e a tutti gradita, nominando per primo chi tra i docenti 
sarebbe ultimo... alfabeticamente. 


Docenti: NoRBERT WIENER di Cambridge (U.S.A.); Y. BAR-HILLEL di 
Jerusalem, V. BRAITENBERG di Napoli, R. BUSA di Gallarate, E. R. CATANTELLO 
di Napoli, W. DAvENPORT di Lexington (U.S.A.), R. M. Fano di Cambridge 
(U.S.A.), A. FEINSTEIN di Stanford (U.S.A.), D. GABOR di Londra, P. E. GREEN 
di Lexington, M. HALLE di Cambridge (U.S.A.), B. HAssENSTEIN di Tubinga, 
H. Haus di Cambridge (U.S.A.), D. A. HUFFMANN di Cambridge (U.S.A.), 
Y. W. LEE di Cambridge (U.S.A.), L. LOFGREN di Stoccolma, B. McMILLAN 
di Murray Hill (U.S.A.), B. MANDELBROT di Paris, G. Moruzzi di Pisa, E. 
NEWMAN di Cambridge (U.S.A.), W. REICHARDT di Tubinga, R. RIGHI di 
Roma, N. RocHESTER di Yorktown Heights (U.S. = a W. ROSENBLITH di Cam - 
bridge (U.S.A.), J. SCHOUTEN di Eindhoven, M. ScHUTZENBERGER di Parigi, 
D. SLEPIAN di Murray Hill (U.S.A.), F. STUMPERS di Eindhoven, 8. WAra- 
NABE di Ossining (U.S.A.). 

A questi che ho nominato il ringraziamento più vivo per la loro collabo- 
razione, specie all’amico prof. EDUARDO R. CATANTELLO che, sobbarcatosi già alla 
non piccola fatica dell’organizzazione scientifica del Corso, si sobbarca ora a 
quella della direzione: a lui esprimo, anche a nome della Società tutta, un rin- 
graziamento particolarmente caloroso ed affettuoso per tutto quello che ha 
fatto e farà per l’attuazione del Corso. 


Studenti: C. Boum di Roma, F. BRESSON di Parigi, J. C. BRIANNE di Lilla, 
M. CECCARELLI di Bologna, G. CoLomBo di Milano, A. CUZZER di Roma, A. 
FIORENTINI di Arcetri, J. Hm di Redding, F. DE JAEGER di Eindhoven, 
F. LAURIA di Napoli, A. LEPscHy di Roma, L. LunEeLLI di Milano, L. 
MONTANET di Genève, A. L. NAGEL di Eindhoven, H. OnZu di Vienna, N. 
ONESTO di Napoli, F. PANDARESE di Napoli, F. PIERANTONI di Bologna, G. 
QuAZZA di Milano, A. RUBERTI di Roma, C. ScHAERF di Roma, W. F. 
SCHALKWJIK di Eindhoven, P. ScHNUPP di Gottinga, C. VAN SCOONEVELD 
dell’Aja, V. SOMENZI di Roma, A. J. Stam del’ Aja, L. STIBE di Cambridge 
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(U.S.A.), D. VARIU di Tubinga, V. VITTORELLI di Palermo, J. Wirt di Monaco, 
R. Woopcock di Hampton. 

A tutti l’augurio più vivo di trarre il miglior profitto da questo Corso- 
Convegno. 


Non voglio infine perdere l’occasione datami da questa riunione inaugurale 
per rinnovare le espressioni di più viva gratitudine verso tutti coloro che col 
loro aiuto finanziario hanno messo la Scuola in condizioni di potere compiere 
quest’anno questo sforzo veramente notevole di organizzare e svolgere quattro 
corsi. Già nominai i nostri sovvenzionatori nel mio breve discorso d’inaugura- 
zione del I Corso 1958 chiusosi ieri l’altro, sabato; ed è forse superfluo che io li 
ricordi nominativamente uno per uno; ma al Massachusetts Institute of Techno- 
logy di Cambridge degli Stati Uniti, che si è sobbarcato le spese di viaggio 
dei molti docenti a questo Corso che provengono dall’ America, e all’ Univer- 
sità di Napoli che ha efficacemente aiutato a sostenere le spese di prepara- 
zione organizzativa del Corso, desidero porgere, anche a nome della Società, 
un particolare ringraziamento. 


E ormai chiudo questo mio discorso. Ieri sera parlando con alcuni amici 
ho sentito esprimere la loro gradita sorpresa per la varietà delle nazioni che 
in ognuno di questi corsi internazionali sono rappresentati: diciotto in quello 
passato, undici in questo. È uno dei meriti della scienza in generale e della 
Fisica in particolare il saper gettare ponti ben saldi al disopra delle così tante 
separazioni che col nome di « confini » dividono gli uomini dagli uomini. E non po- 
tremmo allora fondatamente sperare (o invece è proprio follia sperare) che cid 
possa essere domani la via per la quale finalmente tutti gli uomini troveranno 
una base comune di sicura e cordiale convivenza? Se a questo — come è auspi- 
cabile — e per questa via un giorno si potrà arrivare, anche la nostra Scuola 
Internazionale di Fisica, anche Varenna avranno il loro piccolo merito. 


Con questi sentimenti dichiaro aperto il TI Corso di Fisica 1958 — VII dal- 
l’inizio della Scuola Internazionale di Fisica della nostra Società — relativo 
alla Teoria della Informazione. 
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SUPPLEMENTO AL VOLUME XIII, SERIE X N. 2, 1959 
oO 
DEL NUOVO CIMENTO 3° Trimestre 


Prolusione 
DI 


EDUARDO R. CAIANIELLO 


Direttore del Corso 


The seventh Course of the International Summer School of Physics at 
Varenna is the first ever held here on the Theory of Information. Let me 
welcome first of all the lecturers and guests who have braved oceans and 
mountain passes to be with us here for the coming two weeks. 

We have a few Italian lecturers and many Italian students here, and I 
welcome them as an encouraging sign that one aim of this school is being 
achieved, namely to further the growing interest for this new field among Ita- 
lian Universities. 

We hope that this will not turn out all too strictly speaking to be a school 
composed only of transmitters and receivers of knowledge. There will be 
ample opportunity for argument and, we hope, for the building of new bridges | 
among the representatives of various branches of cybernetics. To achieve this 
end, the presence of Professor NORBERT WIENER, to whom we owe more than 
just the name of this Science, is especially auspicious. And we are glad that 


you have by general acclamation elected him as Permanent Chairman of our 
meetings. 


Most meetings are much more work to prepare than this one has been, 


which could avail itself of the smooth-running organization of the Summer 
School of Physics at Varenna of the Società Italiana di Fisica. Our Special — 
thanks go to Professor GIOVANNI POLVANI, President of the Italian Physical 
Society, whose untiring efforts have created this physicist’s paradise. 
Last, certainly not least, let us thank the Ministero Italiano della Pub- | 


ui 
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blica Istruzione, the Società Italiana di Fisica, and the Massachusetts Insti- 
tute of Technology, without whose magnificent sponsorship this meeting would 
not have been possible. 


La Cibernetica — di cui la Teoria della Informazione costituisce il nucleo 
matematico — ha solo di recente acquisito fisionomia e nome di scienza auto- 
noma. Essa ha come oggetto l’indagine delle relazioni esistenti tra le varie 
parti di un organismo, prescindendo dai particolari costruttivi di ciascun ele- 
mento, che viene caratterizzato esclusivamente mediante la sua funzione. 

Se chiamiamo « anatomia » di un organismo — sia esso una macchina 0 
una società industriale o un essere vivente — lo studio della sua costituzione 
particolareggiata, potremo allora denominarne « fisiologia » lo studio ciberne- 
tico; 0, con altra analogia, possiamo paragonare il passaggio dallo studio strut- 
turale a quello cibernetico, alla transizione dall’ Aritmetica all’ Algebra. 

Come appunto succede allorchè nelle relazioni matematiche, i numeri par- 
ticolari vengono sostituiti con simboli algebrici, si scoprono proprieta generali 
prima insospettate; cosi organismi diversissimi rivelano di possedere identico 
funzionamento, e metodi matematici per lo studio diretto di relazioni funzio- 
nali vengono via via opportunamente elaborati. 

Tl funzionamento di un organismo complesso richiede che ciascun costi- 
tuente di esso abbia conoscenza del comportamento degli altri costituenti; le 
comunicazioni tra le varie parti di un sistema sono quindi di fondamentale 
interesse, e il loro studio — Scienza delle Comunicazioni — viene volta a 
volta riguardato, a seconda dei punti di vista, come parte essenziale della 
Cibernetica, o come addirittura coincidente con essa. 

Ciò che viene comunicato è una informazione; la precisazione quantitativa 
di questo concetto è il punto di partenza della Teoria dell’Informazione, che 
elabora l'apparato matematico necessario agli studi cibernetici. 


Si comprende dunque come il campo aperto a tali indagini sia vastissimo. 
Scegliendo alcuni esempi a caso, questioni quali la più efficiente utilizzazione 
di una linea di trasmissione, l’invenzione di calcolatrici elettroniche o di mac- 
chine atte a dimostrare teoremi di Logica, la traduzione meccanica dei lin- 
guaggi, l’analisi quantitativa dei fenomeni nervosi, possono dare un’idea della 
varietà pressocchè illimitata di temi che la Cibernetica si propone di trattare 
con metodo unitario. 

Mentre tutte le altre scienze si vanno differenziando e specializzando sempre 
più e creano linguaggi mutuamente incomprensibili, la Cibernetica si presenta 
come un tentativo di sintesi, un ponte gettato tra molti rami del sapere, una 
disciplina che vuole intendere, collegare e coordinare teorie e fatti propri delle 
altre o di altre discipline, mediante la scoperta di funzioni-comuni in oggetti 
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di natura diversissima. A seconda dell’interesse del ricercatore, essa può inten- 
dersi come un raffinamento di principi di tecnica elettronica, o, via via innal- 
zandosi, come un nuovo umanesimo scientifico che studia il comportamento 
di collettività umane, o la genesi e la simulazione del pensiero, su basi stret- 
tamente quantitative. È 

Immenso sviluppo hanno queste ricerche nei paesi più avanzati scientifi- 
camente, Stati Uniti e Russia; governi e industrie le appoggiano in tutti i 
modi, ed è facile prevedere che, tra non molto, attraverso di esse, si giungerà 
ad una rivoluzione industriale sconvolgente almeno quanto la prima, che sostituì 
il lavoro della macchina al lavoro muscolare: il lavoro intellettuale « di tipo 
non creativo » sarà svolto dalle nuove macchine cibernetiche (come già sta 
avvenendo in diversi campi). 

L’Italia è stata finora totalmente assente da questa attività di ricerche e 
applicazioni: circostanza questa particolarmente dolorosa, in quanto molte di 
queste indagini si possono compiere con dispendio minimo di denaro. 

Il Corso, che ho Vonore di dirigere, è stato concepito e organizzato nella 
speranza che esso possa segnare l’inizio, anche da noi, di una seria attività di 
studio e di ricerca in questo campo, e di un’attiva collaborazione in ambito 
sia nazionale, sia internazionale. 

Esso ha perciò una fisionomia particolare: come accennava or ora il Presi- 
dente della Società Italiana di Fisica nel suo discorso di apertura, invece della 
consueta distinzione in docenti, allievi e uditori, e del consueto rapporto nume- 
rico tra di essi, noi abbiamo in questo Corso solo due gruppi, egualmente nume- 
rosi (di circa trenta persone l’uno): il primo è costituito in gran parte da 
scienziati di chiara fama internazionale, convenuti qui da molte parti del 
mondo; il secondo da partecipanti, la maggior parte dei quali possono dirsi 
studenti solo perchè desiderosi di apprendere dai primi i fondamenti di una 
nuova scienza. 

È questo, dunque, in parte un Convegno, in parte una Scuola: le ore pome- 
ridiane saranno dedicate in prevalenza a comunicazioni originali, a discussioni, 
a seminari; quelle mattutine a lezioni di carattere istituzionale. Questo Corso- 
Convegno è anche, nell’ambito degli studi cibernetici, la prima iniziativa così 
impostata ed attuata; e la qualità degli studiosi che hanno voluto onorarci della 
loro presenza e la splendida tradizione che ormai si ricollega al nome della 
Scuola estiva di Varenna, sono certo auspicio del suo felice successo. 
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LECTURES 


A Descriptive Introduction 
to the Statistical Theory of Communication. 


B. McMILLAN 
Systems Engineering Division, Bell Telephone Lab. - Murray Hill, N.J. 


Introduction. 


As an introduction to the subject of this Course I propose to attempt 
a classification of some of the problems considered in the theory of commu- 
nication and to relate these problems to each other, and to the problems 
considered in conventional statistics. 

Three domains of discourse will be considered, each briefly: 


1) Statistical inference, 
2) Communication as related to measurement and control, 


3) Communication as a service (e.g. telegraphy), 


The modern approach to statistical inference began to develop in, roughly, 
1900. An important stimulus to its growth has been the need of experimenters 
in agriculture for tools to handle highly variable data. It suffices for this 
introduction to divide the problems considered under this heading into two 
classes: 


A: Testing hypothesis, 
B: Estimation of parameters. 


In fact, these divisions are neither exhaustive nor mutually exclusive. 

The typical problem of testing (A, above) is illustrated by the following 
agricultural experiment: several plots of ground, as nearly alike as possible, 
are selected and prepared for planting. Half of them are treated with ferti- 
lizer, the other half not treated. They are then planted and cultivated alike. 
When the crops are harvested, the yields from each plot are determined. 
Given the several yields y,, Y2,---, Yn from the treated plots, and the yields 
21) 225+) 2m from the untreated plots, the statistician is then asked to test, 
i.e. to accept or reject, the so-called «null hypothesis »: the hypothesis that 
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the fertilizer has no effect. His method, of course, is to reject the hypothesis 
if the y’s appear «too different » from the 2’s; the critical degree of difference 
is determined by the internal variations among the y’s, and among the 25. 

Underlying the statistical treatment of this experiment is a mathematical 
model which can be diagrammed thus: 


Phenomenon 


Assumed causal relation 


Instrument 
or 
experiment 


Comparison 
and 
decision 


Observations 


Duplicate 
instrument or 
experiment 


Haye 1: 


This diagram symbolizes the fact that the phenomenon of interest is ob- 
served through an intervening instrument or experiment with which it is pre- 
sumed to be causally related. In general the instrument has two important 
defects which prevent the observations it develops from representing exactly 
the phenomenon being studied. In the first instance, it is typical that the 
instrument distorts, 4.e., presents an image of the phenomenon under study 
which is not an exact copy and even may be so distorted that some features 
of the original phenomenon are destroyed. That is, it may be that some 
features of the original cannot be recovered or determined from even a perfect 
knowledge of the image. This defect does not directly interest the statistician, 
though it may be of great importance to the experimenter, or to one studying 
the theory of the instrument itself. 

The second defect of the instrument is that its output, the observation, 
does not in general even represent perfectly the distorted image referred to 
above. In fact, the output is in general not related in a completely causal way 
to the input. That is, there are perturbations in the observations which are 
unpredictable in detail, vary from one observation to the next, and can be 
understood, if at all, only in terms of the statistical laws which govern them. 
It is now fashionable to call these perturbations « noise». To ameliorate the 
effect of noise is the task of the statistician. 

In the diagram above, a duplicated experiment is indicated. This is one 
in which the phenomenon of interest is absent by design. The resulting ob- 
servations then calibrate the instrument and, in a statistical sense at least, 
« calibrate » the noise. Statistical inference or decision results by comparison 
of the two kinds of observation. 


The upper line of the diagram above also represents a mathematical model 
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for the second kind of statistical problem, that of estimating parameters 
(B, above). In this kind of problem, the phenomenon of interest is the value, 
6, of some numerical parameters. It is assumed, or known from the theory 
of the instrument, that the observation Y is governed by a known statistical 
law depending in a known way upon the, still unknown, value of 0, thus 


(1) Probability {Y < y} = Fly; 6), 


where for each 0, F(y; 0) is a distribution function of y. 

From observations of Y, one wishes to estimate a value of 6. 

In structure, this problem is not so different from that of testing hypo- 
theses as may at first seem. In the first place, the second line of the diagram 
above is, in a sense, still present. The knowledge obtained from these « cali- 
brating » experiments is already incorporated in (1), î.e., somehow or other, 
and ultimately presumably by observation, one has discovered the dependency 
indicated in (1) between observation Y and parameter 0. In the second place, 
determining a suitable value for 0 is a limiting case, as N — oo, of testing 
simultaneously the succession of « neighbouring » hypotheses: 


cy ei 


î k 
- hypothesis k WV N? 


where e.g. —N?<k< N?. 

The real difference between problems A and B, testing and estimation, 
lies in the measures of success used by the statistician. Somewhat loosely, 
one may say that in testing hypotheses the statistician is interested in how 
often he accepts the correct hypothesis. All failures to accept the correct 
hypothesis are equally distressing to him. A characteristic of problems of 
estimation is that, in general, the parameter being estimated has a quanti- 
tative meaning, large errors are considered to be more serious than small ones, 
and one attempts to estimate, not exactly (since this may be impossible), but 
in such a way as to minimize some numerical measure of error or of average 
error. 

A practical characteristic of most methods of statistical inference is that 
they are designed for situations in which there are relatively few basic data, 
in particular, for problems in which there are only a finite number of random 
variables. A further characteristic of great theoretical importance is that they 
are also designed for situations in which the statistician cannot, a priori, assume 
any statistical laws for the underlying phenomenon. For an example, in the 
agricultural experiment described above, it is unlikely that he would be justified 
in assuming that «the probability that this fertilizer will have no effect is .3 ». 
Similarly, in many problems of estimation it is not valid to assume that the 
parameter 0 is itself a random variable. 
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Some preliminaries, and a description of Shannon’s theory. 


A proper formulation of the theory of information requires the mathe- 
matical apparatus of probability theory. Even a careful descriptive account 
requires enough of the terminology, to justify its introduction here. Proba- 
bility begins with an exhaustive listing of all of the « elementary events » which 
are assumed to be possible. Let W denote the totality of these elementary 
events. In a simple example, W consists of six elementary events: these are 
the six possible outcomes in the throw of a single die. It is convenient to call W 
a «space », and its elements w the points of that space. An event is either a 
point w, or a collection (set) of such points. Thus, in the throw of a die, the 
occurrence of 1 on the upper face is a « point » of the relevant W; it is also 
an event. Other events, are, for example, i) W itself (the occurrence of some- 
thing), or ii) the occurrence of an even number. It is convenient to call the 
vacuous subset of W also an event. 

In all but the simplest problems, it becomes necessary to restrict the events, 
the subsets of W, for which one calculates probabilities. Theory, however, 
is useless if there are too few events for which probabilities can be stated. 
The proper compromise is to consider a family F of events which has three 
properties of completeness: 

i) W itself is in F; 
ii) if H is in 7, then W—E is in F (W — E is called the complement 
of E; it consists of all points of W which are not in E); 
iii) if Z,, E., H;,..., are in F, then the union Z,vE,vEx... is in F. 
(the union E,0E,vE;v... consists of all points of W which are in some E,, 
fe 13 Dy Sense), 


A collection # of subsets of W having these three properties is called a 
Borel field. A Borel field # can be described somewhat loosely, thus: if F 
contains H,, H,,..., then it also contains any event Æ which can be described 
in terms of E,, #,,..., by means of a logically well formulated sentence. 

The final element of structure is probability itself. This is a function P{E} 
defined for all events Æ in F with the three properties: 


1) PIMi=1; 
ii) for E in F, 0<P{E}<1; 


ii) if E,, B,,..., are in F and if, when à), E, and E; contain no 
points in common, then 


P{E,0E,v...} = P{E,} + P{B} +... 


(in other words, P{E} is additive over mutually exclusive events). 
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A random variable « is a numerically valued function «(w) defined on 
W More precisely, it is a function for which such descriptions as «the pro- 
bability that x < a» are meaningful. That is, for each number g the set of 
w such that æ(w) <g must be in F. This is a necessary restriction, but not 
a serious one. It is convenient also to allow x(w) to be undefined at some 
points w provided the probability of the set of such w is 0. 

For events # and B in F, the conditional probability of E, given B, denoted 
by P{E|B} is by definition 


P{ENB} 
P{E|B} = PIB ’ 
when P{B}~0. It is not definable, in this form when P{B}= 0. In this 
definition Z/B denotes the intersection of E and B, the set of all w which 
are both in Æ and in B. Therefore P{E|B} is the relative fraction of the 
probability in B which is covered by points which are also in E. 

Let y be a random variable which assumes only finitely many distinct 
values. That is, there are numbers di, b,, …, 0, such that the sets B,, de- 
scribed by 

| B;= (all w such that y(w) = b;) 


exhaust W. Then given an event E, for each B; such that P{B,} #0 (and 
there must be at last one such, since 1= > P{B;}) the conditional probabi- 
lity P{Z|B,} is defined. This quantity is now a function of i, i.e., a function 
of y. We use the symbol P{E|y} to denote this quantity. For each E it is 
a random variable, namely, a numerical function of w which takes, for each 
w in the set B,, the value P{H|B,}. Since the B, exhaust W, P{E |y} is un- 
defined at most over a set of probability 0 (this set being the union of all B; 
such that P{B;}—0). Notice that P{H|y} is not a general random variable; 
it is a function of y. 

*In a similar way, one can define P{H|y,, Yo, ---) Yn} given several discrete 
random variables. From this definition one can then pass by limiting ope- 
rations to a general P{E|y,, Ys, ---}, where the y’s may be infinite in number, 
and indeed need not be discrete. In all cases, for fixed £, P{E|\y,,...} is a 
random variable which is, apart perhaps from a set of events having proba- 
bility zero, a function of the indicated conditioning variables. 

Turning now specifically to the probability theory needed for a description 
of the theory of information, let A be an alphabet i.¢., a finite list of sym- 
bols We A, ., 4. LetcW. be the collection of all infinite sequences w, ..., 
C1, Xo, Ans Las +-+, Where each term a, in the sequence is a letter drawn from A. 
Consider a set C of sequences described in the following way: C consists of 
all sequences w such that x, =l,, %, = lr ata where the 1, ..., l, are 
specified letters, not necessarily distinct, of A. For convenience, call a set C 
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described in this way a cylinder set. Let F be the smallest Borel field of sub- 
sets of W which contains all cylinder sets. Once an alphabet A is given the 
probability space W is defined and it is a theorem that the Borel field F is 
then unique. The dependence of W and F upon A will sometimes be indicated 
by a subscript. 

It is also a theorem that to specify a probability for sets of F it suffices 
to define P{C} for cylinder sets C. More exactly if P{C} is a true function of 
cylinder sets C in that P{C} is independent of the mode of describing C, and 
if 0<P <1, then the domain of definition of P can be extended by limiting 
operations to all sets of F, and P defines a probability theorem. 

In saying that P{C} shall be independent of the mode of describing C, we 
mean the two things illustrated by the following examples: 


Plait, and ta == da} = Piag=hk and == iY 


N 
Pla = and a=l}= > Pla, =, and ag=h and «,—A;}. 


i=1 


The structure just described: an alphabet A and a probability defined on . 
the resulting F,, constitutes an information source. 

A source is said to be stationary if, for each cylinder set and each 
(SA ee ee 


Pia, = ls DELI XF 1} == Pio, 3 = ki; be +t, — hi 009 Cet, — 1} . 


That is, the joint distribution of any xt, +-+» & is the same as that of the 
corresponding %x+:, + %x+:, and is independent of absolute time. 

Consider an alphabet B, in addition to A. An encoder is a sequence D,, 
n=0, +1, +2,... of functions from W, into B with the three properties: 


i) for each w, D,(w) is a letter of B; y 


ii) for each n, ®,(w) = D, (an, On-1) Xn-2,+.-) (that is, D, does not de- 
pend upon the letters of w which occur after t— n); 
ili) given ®,(w) for all n, w is uniquely determined. 


Ma encoders considered in information theory are not stationary, but are 
periodic in the sense that for some fixed N 


Di(l, le, …) = D, wh, la, ...) 


for all k and for all sequences lis la, ..., of letters of A. 
A channel may be thought of as a generalization of an encoder. Different 


1162 


A DESCRIPTIVE INTRODUCTION TO THE STATISTICAL THEORY OF COMMUNICATION 351 


versions of Shannon’s fundamental restricted theorem require different 
definitions of channel, but the general idea can be illustrated thus: let A and B 
be alphabets and for convenience now denote W, by X, W, by Y, À, by F,, 
F, by F,. Consider also the alphabet AB whose letters are the ordered pairs 
(4:, B;) where A; is a letter of A, B; a letter of B. W,, consists of all in- 
mite Sequences... (ot, B-1), (Go, Bas (O15 Bx), os Where @=...01, Qo) djs 
y=... B-1, Bo, Bi, ... are respectively drawn from X and Y. Again, we write 
XY for W,,, and F,, for the corresponding Borel field. In the most general 
sense, a channel is a function of probability measures on F, with values which 
are probability measures on F,,. 

More precisely, all channels considered in the literature are of this form: 
given P, defined on F,, there is a Q,, defined on F 
perties: 


with three basic pro- 


wy? 


i) if P is stationary, Q, is also, 
li) if E is an event in F,, then Q,{E} — P{E}; 


iii) if C is a cylinder set in F, which depends only upon B, for n<t then 
QtC | Di} = A{C| D3} , 


for any two cylinder sets D,, D, in F which specify the same values for 
those «, for which n<t. 


Condition ii) may be illustrated by the following example: 
Q(Bi= Balai = Ar and ag, = A} = Q,{8 i= filer, = A}. 


In other words, a channel is a consistent means for inducing a joint distri- 
bution Q, on input and output, given any initial distribution P on the input 
alone, and, the distribution of the output cannot anticipate the input. 

Further elements of structure internal to the channel must be specified 
before Shannon’s fundamental theorem can be proved, but the definition just 
given is adequate for a discussion of that theorem. 

The fundamental quantity of Shannon’s theory is a number, called infor- 
mation rate, associated with each stationary information source. For comple- 
teness we state one form of its definition, but it suffices for the present dis- 
cussion merely to know that this number exists. Given P defined on F,, by 
definition the information rate of the resulting source is 


N 
H(x)= average [> — log Pia, = A;|o1; x2, ait | . 
ei 


Now given a source and a channel, there are three information sources 
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for which rates can be calculated, the input the output and the joint source 
whose sequences are those of XY. The corresponding rates are H(X) above 


H(Y)= Average [S —log Q,{Bo = Bs|B-1; B-ay 3] ; 


and 
di sf 
H(X, Y)= Average > D — log Q,{xo = Az and By = Bla; Bass % 2) Pa; …}] ; 


i=1j=1 


where all averages are taken with respect to the probabilities Q,. 

It is a theorem that the quantity R(X, Y)= H(X)+ H(Y)—H(X, Y), called 
the rate from X to Y, is always > 0. This quantity is a function both of the 
channel and of P. Its maximum, as P is varied, over all stationary measures 
on. F,, is called the capacity of the channel. For certain restricted kinds of 
channels then Shannon’s fundamental theorem asserts the following: 

Given a channel of capacity € and a source of rate H = H(X), if H< QO, 
there exists, in a certain limiting sense, an encoder and decoder such that if 
the text from the source in encoded before presentation to the channel, the 
decoded output of the channel recreates the text without error. 

The «limiting sense » in which this theorem holds is this: in the limit of 
perfect reception the delay between output and input is infinite, and the 
encoder is one with an infinite number of distinct functions ®,. This limiting 
case is approached as one insists on smaller and smaller probabilities of error 
in reception. 
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The Statistical Theory of Information. 
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1. — Introduction. 


I propose to discuss the foundations of the statistical theory of infor- 
mation to which Dr. McMILLAN referred as the theory of telegraphic commu- 
nication. This theory, which was formulated by SHANNON in 1948[1], provi- 
des a mathematical framework for the quantitative study of communication 
processes. The three main topics that I shall cover are best described in 
terms of the model of communication system illustrated in Fig. 1. 


— — — — — — —— —  — —————— 


È 
source encoder | encoder decoder decoder receiver 
| 
| 
| Random 
| disturbance 


| 
| 
| 
| 
REA ERO ERI AEON “EE 4e J 


Fig. 1. — Schematic model of communication system. 


The message output from the source may be a printed text, a picture, a 
time function representing the acoustic wave produced by a speaker, the output 
from a digital computer, etc. The purpose of the first encoder is to represent 
any such message in a standard form, such as a sequence of binary digits. It 
seems reasonable to require such a representation to be economical in the 
sense of using, on the average, as few binary digits as possible. 

The purpose of the second encoder is intimately related to the character- 
istics of the channel, and, in particular, to the fact that the channel is always 
subjected to random disturbances. We shall assume, for the purposes of our 
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discussion that the channel is discrete in the sense that it accepts as an input 
only symbols belonging to a specified finite set, such as the letters of the Latin 
alphabet, and makes available, as an output, symbols belonging also to a spec- 
ified finite set, not necessarily identical to the first one. Because of random 
disturbances in the channel the output symbol is not uniquely specified by 
the input symbol; rather only the probabilities of the different output symbols 
are functions of the input symbols. Thus, in general, the input symbols cannot 
be identified with certainty on the basis of the output symbols, and errors 
will inevitably be committed when any such identification is attempted. The 
function of the second encoder is to transform the standard representation 
of the input message into a sequence of symbols which reduces as much as 
possible the chance of making an erroneous identification of the message after 
it has been transmitted through the randomly disturbed channel. We shall 
see that, under certain conditions, it is possible to make the probability of 
erroneous message identification as small as desired. The function of the first 
decoder is to identify the sequence of symbols input to the second encoder on 
the basis of the corresponding sequence of symbols output from the channel. 
Thus, ideally, the output from the first decoder should be identical to the 
input to the second encoder. Finally, the function of the second decoder is 
to reconstruct the original message, on the assumption that its standard rep- 
resentation was correctly identified by the first decoder. 

The objective of the overall communication system is to make available 
to the ultimate receiver a correct replica of the input message, and to accom- 
plish this efficiently, in the sense of using on the average as few channel symbols 
as possible. It is clear that, in order to be able to speak about the efficiency 
of the system we must be able to characterize quantitatively what is trans- 
mitted through the system, as well as the capability of the specified channel 
to transmit it. That is, we must define a suitable measure of the extent to 
which a symbol output from a «black box » Specifies the corresponding input 
simbol, whether the black box be the specified channel or a coding device to 
be designed. The first topic in this series of lectures is the definition of such 
a measure and the study of its elementary properties; we shall refer to the ” 
object of this « measure » as the information provided by one symbol about another 
symbol. 

The measure that we shall define has a great deal of intuitive appeal. We 
must be very careful, however, to avoid regarding this measure as a funda- 
mental one, just because its properties check so well with our intuitive notion - 
of «information ». Its importance, which is indeed great, stems from the two 
fundamental theorems stated by C. E. SHANNON in 1948 and further refined 
by SHANNON himself and others since that time. 
the operation performed by the first encoder. 
that the number of binary digits required, on th 


The first theorem concerns 
It states, roughly speaking, 
e average, to represent a mes- 


1166 


x 


THE STATISTICAL THEORY OF INFORMATION 355 


Sage is equal to the average amount of information that must be provided 
about a message belonging to a specified ensemble in order to identify it 
uniquely. This theorem constitutes the second main topic in this series of 
lectures. 

The second fundamental theorem concerns the operations performed by 
the second encoder and by the first decoder. It states that under certain con- 
ditions it is possible to encode and decode messages in such a way that the 
probability of erroneously identifying the message transmitted becomes arbi- 
trarily small. This can be accomplished as long as the average amount of 
information provided about the message by each symbol input to the channel 
is smaller than the information capacity of the channel; that is, smaller than 
the average amount of information that can be provided by each output symbol 
about the corresponding input symbol. This theorem constitutes the third 
main topic to be discussed in these lectures. 

Because of time limitations we shall confine our discussion to discrete 
channels, although actual physical channels are more closely represented by 
continuous models. It will suffice to say here that all the important properties 
of discrete channels can be generalized to continuous channels. 


2. — Definition and elementary properties of a measure of information. 


m 


Yr Yay se Yis «+» Ym. De the points of a discrete space Y. We shall denote with x 
the variable representing a point of the first space and with y the variable 
representing a point of the second space. A probability P(x, y) is defined over 
the product space XY, consisting of the points representing all possible 
pairs x, y. We may think, for instance, of x as the symbol input to the channel 
of Fig. 1, and of y as the corresponding output symbol. We may also think 
of x as the message generated by the source, and of y as one of the symbols 
output from the first encoder. Our objective is to define a suitable measure 
of the information provided by the event y= y;, about the event v=. 

Let us consider this question from the point of view of how well the event 
y=y; can identify the event «= x, in the eyes of a person who can observe 
only the first event. The «a priori» probability of the event x =, is: 


Let %, %,... x) --- Cm, be the points of a discrete space X and 


(1) Pa) = X P(0, 9), 


where the summation extends over all the values of y. The «a posteriori » 
probability that the event «=, has occured, that is its probability condi- 
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tioned by y= y: is: 


P(X, Yi) __— P(®y Yi) 
"i I e SP 


where the summation extends over all the values of x. Clearly the effect of 
the observation of the event y= y; is to change the probability of the event 
x — x, from P(x,) to P(x;ly;). It seems reasonable, therefore, to require that 
the information provided by the occurence of the event y = y; about the occur- 
rence of the event x= x, (for short, provided by y; about #,) be a function 
of these two probabilities. Let us denote this measure of information by I(wx; y;). 
The first requirement on the measure of information to be defined is: 


a) I(xx; y:) = F(9, v), where F(9, v) is a once differentiable function of @ 
and » and y= P(a,), v= P(a;/yi). 


We observe, on the other hand, that if the two events involved depend 
on the occurrence of a third event which has already been observed, consistency 
requires the a priori and a posteriori probabilities involved in the measure 
of information to be conditioned by this third event. Let then 21, 22, ...2: +++ @m_5 
be the points of a third space Z represented by a discrete variable z, and assume 
that a probability P(x, y, 2) is defined over the product space XYZ of all 
triplets 7, y, 2. Denoting by I(#,; y;/2;) the information provided by y; about x, 
when 2; is given, we require: 


b) I; yi/%;) = F(p,v) where y= Pare), v= Plax/yi 23) - 


Let us consider next the information Z(x,; y;, 2;) provided by the pair of 
events y=y; and <= 2; about the event «= x,. It seems reasonable to require 
that the measure of this information be the same whether we regard the two 
events y= y; and <= 2; as observed simultaneously as a pair or individually 
in succession, Thus we require that 


c) (3) Tr; Yin 25) = I(0x; Yi) +I(0x; 8;/Y) , 


Let us consider finally two additional discrete variables é and 7, Statistically 
independent of +, y 


(4) P(x, Y È, n)= Ple, y) PE, n) 4 


Because of this statistical independence, the information provided by the 
events y—Yy;, 7= Mm about the events x — Tr, €=&,, should be independent 
of whether we regard the events of each pair as separate events or as forming 
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a single composite event. Thus we must require: 


d) (5) Tr, E03 Yis Mn) = (ar; Ys) + IE; ma) 


It can be shown that the four conditions stated above are sufficient to 
specify uniquely a measure of information. We obtain for this measure 


(6) F(g, ») =log—, 
: gp 


Thus, for instance, the amount of information provided by y; about #, is 
given by 
: P (dx /y ;) 


(7) I(%5 y:) = log P(t.) È 


The derivation of this result is similar to that given by P. M. Woopwarp [2]. 
The base of the logarithm is arbitrary, and affects only the size of the unit 
of information. For reasons that will become evident later, the base-two loga- 
rithms are most often used, and the name « bit » is employed to denote the 
unit of information. -We shall follow this convention, and all logarithms will 
be understood to be base two unless otherwise stated. 

This measure of information has a number of interesting properties. In 
the first place, multiplying the numerator and the denominator in Eq. (7) by 
P(y;) yields 


P(%,, Yi) do Py. |e) 
P (xp) P(Y:) P(y;) 


(8) I(%x; Yi) = log = I(Y;; %) . 


Thus the information provided by y; about x, is equal to the information 
provided by «, about y;. In other words, the measure is symmetrical in the 
two events. For this reason it is convenient to refer to it as the « mutual 
information » between the two events. Clearly, it is a measure of the extent 
to which the two events are more likely to occur together than if they were 
statistically independent. 

The mutual information /(x,; y;) becomes a maximum for a fixed P(x,) 
when P(#,/y;) = 1, that is, when x, is uniquely specified by y;. This maximum 
value is equal to 


(9) ici I (a) = — log P(a,) . 


We shall refer to I(æ,) as the «self information» of the event a=x,. It 
represents the amount of information that must be provided about this event 
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by some other event in order that the former be uniquely specified. Conversely, 
if we regard the event 2 = x, as providing information about some other event, 
its self information represents the maximum amount of information that it 
can provide about the other event. In general 


| < I(xx) , 
(10) T (ae; y:) À 


It is important to observe that 1(+,; y;) may be negative as well as posi- 
tive. It is negative when the probability of occurrence of the pair of events 
v= xx, y = y; is smaller than if the variables æ and y were statistically inde- 
pendent. On the other hand the average value (expectation) of I(x;; y;) over 
the set of events X can be shown to be non-negative, that is, 


(11) I(x = 2 P(ely)I (2; y:) > 0. 


This result checks with our intuitive notion that, if # and y are statistically 
related, the event y— y; must provide on the average a positive amount of 
information about x. 

The average value of the self-information associated with « is given by 


(12) H(X)=  P(x)1 x)—— X P(a) ) log P(x) > 


x 


Because of the form of this expression, it is customary to refer to it as the 
entropy of the ensemble of events X. It can be interpreted as the amount of 
information that must be provided, on the average, to identify one of the 
events of the ensemble. It can also be interpreted as the maximum amount 
of information that can be provided on the average by one of the events of the 
ensemble X about some other event. 

The entropy H(X) is a function of the probability distribution P(x). It 
can be readily shown that it becomes a maximum when 


(13) P(e) = > 


for all the m, events of the set, that is when the events are equiprobable. The 
maximum balia is: 


(14) H(X),...= log M, . 


max 
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The conditional entropy 
(15) H(Y/X)= > > P@, y)I(y]e) = — > > P(e, y) log P(y/a), 


represents the average amount of information that must be provided in order 
to specify the event y when the event x is known. The average amount of infor- 
mation that must be provided in order to specify both events is given by the 
joint entropy: 


(16) H(X, Y)= 2 AR y)I (a, y) = — > > P(x, y) log P(a, y). 
We obtain with the help of Eq. 12 and 15 

(17) HA, == AX) HAY EX). 

It can be shown, in addition, that 

(18) H(Y/X) < H(Y), 


in words, the amount of information that must be provided, on the average, 
in order to specify y can only be decreased by the knowledge of +. 


8. — The first fundamental theorem. 


We saw that the function of the first encoder in Fig. 1 is to provide a 
representation of the input message in the form of a sequence of symbols 
selected from a specified alphabet. The character of this encoding operation 
depends, among other things, on the form in which the message is generated 
by the source. The essence of the problem, however, can be stated in simple 
terms as follows. 

Let us consider an ensemble of M messages, U1, Us, :.., Ur, Uy) With corre- 
sponding transmission probabilities P(w,),..., P(w,,). Let us suppose, also, 
that each message must be represented, for transmission purposes, by means 
of a sequence of symbols (code words) selected from an alphabet with D symbols, 
where D < M. We wish to inquire about the minimum number of symbols 
required, on the average, to specify one of the messages of the ensemble. 

It was pointed out in the preceding section that the maximum amount 
of information that can be provided on the average by one symbol is equal 


7 


360 R. M. FANO 


to log D, the capacity of the alphabet, and that this maximum amount is 
actually provided when the symbols of the alphabet occur with equal pro paba- 
bilities. This suggests that we should construct the code words in such a way 
that at each position in them the different symbols of the alphabet will occur 
equiprobably, and independently of the preceding symbols. The significance 
of this statement can be best understood in terms of the following examples 
of binary code words. 

The 8 messages in Fig. 2 have the same transmission probability P(w) = 27. 
The first binary digit (*) is 0 for the first four code words, and 1 for the remaining 
four code words. Thus the probability that the first digit be 0 is exactly equal 
to the probability that the first digit be 1; in other words, the two sets of 


Messages Code words 


us 000 


2nd 1st 3rd 
Division Division Division. 
Ug 111 
Fig. 2. — Optimum set of code words for equiprobable messages. 


messages separated by the first digit are equally probable. The second digit | 
divides again the message space into two equiprobable sets, with 0 being 
assigned to messages %,, %,, Ws, Us, and 1 being assigned to messages Ws, Wy, Wry Wess 
furthermore, these two sets of messages intersect the two sets formed by the 
first digit in such a way as to yield four equiprobable subsets, namely 
Urs Un} Us Ua; Us, Ue; Uz, Ug. Thus the probabilities that the second digit be | 
a 0 and that it be a 1 are not only equal, but also independent of the first 
digit. The third digit divides once more the message space into 2 equiprobable 
sets, which in turn divide each of the above four subsets into two equiprobable 
parts, each consisting of a single message. Thus the third digit too is inde- 
pendent of the preceding digits as well as equiprobable. Since the three suc- 
cessive digits are used at full capacity, they must provide together an amount 


CAD i a AM use of binary digits, the use of the term «digit » is 
reafter to the bi Eu x 5 
dre e binary case. The term « symbol » is used in the general case 
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of information equal to 3 binary units (bits); this is just equal to the entropy 
of the message ensemble, that is to the amount of information that must be 
provided on the average to identify a message of the ensemble. 


rei 

Messages | Probabilities Code words | 
uy | 0.25 | 00 | 
Us 0.25 01 
Us 0.125 100 
Uy | 0.125 101 
Us | 0.0625 1100 
Us 0.0625 1101 
Uy 0.0625 1110 
Us | 0.0625 | La 


Fig. 3. — Optimum set of code words. 


In the case of Fig. 3, the messages are no longer equiprobable, but their 
probabilities are still of the form È 


(19) P(uz) — 9-nk 5 


where n, is an integer. Again the first binary digit divides the message space 
into two equiprobable sets, which, however, do not contain the same number 
of messages. The second digit divides each of these two sets into two equi- 
probable subsets. Two of the resulting subsets contain a single message, while 
the other subsets are further divided by the third and fourth digit into equi- 
probable parts until all messages are singled out. 

It is clear by inspection that each digit is independent of all preceding 
digits and that it may be a 0 or a 1 with equal probabilities. Furthermore 
all digits of a code word are uniquely specified by the corresponding message, 
so that the mutual information between each digit and the message is equal 
to the self information of the digit. Thus each digit is used at full capacity, 
that is, it provides as much information as possible about whatever message 
is being transmitted. As a matter of fact, since each digit, whether a 0 or a 1, 
occours with probability 4, it provides in all cases exactly one unit of infor- 
mation about the corresponding message. It follows that the sum of the self 
informations of the individual digits of each code word must be equal to the 
number of digits in the code word. On the other hand this sum must also 
be equal to the self information of the corresponding message, because each 
code word is uniquely specified by the corresponding message. We may con- 
clude, therefore, that the number of digits in each code word must be equal 
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to the self information of the corresponding message, that is to the integer 
n, in Eq. (19). This is actually the case in Fig. 3, as it can be readily checked 
by inspection. Furthermore, the integers satisfy the equation 


M 
(20) 2™= > P(u)=1. 


The average number of symbols per code word can be readily computed. 
We have 
M 


M 
(21) N=Y P(ur)nx = 2.75 = — ZX P(ux) log P(ux), 


k=1 k=1 


which is just the entropy of the message ensemble. This result is in agree- 
ment with the intuitive notion that, since each digit is used at full capacity 
(it contributes one unit of information) the average number of digits per code 
word must be equal to the amount of information required, on the average, 
to identify a message (the entropy of the message ensemble). The same rea- 
soning suggests also that the set of code words illustrated in Fig. 3 is optimum, 
in the sense that no set of code words could yield a smaller value for the 
average number of digits per code word. This is actually the case, as shown 
below. 

The idea of constructing code words by successive divisions of the mes- 
sage Space into equiprobable sets and subsets applies to the case of an arbitrary 
alphabet as well as to that of a binary alphabet; one must only divide into D 
equally probable subsets instead of just two. The technique fails, however, 
as one would expect, when the message probabilities are not negative powers 
of D (the number of symbols in the alphabet) because it becomes impossible 
at some point to make equiprobable divisions. Still, one may try to make 
the divisions as equiprobable as possible, thereby hoping to keep the average 
number of symbols per message as small as possible. 

Upper and lower bounds on the minimum average number of code symbols 
per message can be obtained with the help of the following theorem due to 
L. KRAFT. Before this theorem can be stated, however, we must discuss an 
important condition that must be satisfied by the code words assigned to the 
messages. It is obvious that in order for each message to be uniquely specified 
by the corresponding code word, no two code words can be identical. Further- 
more, no code word can be a continuation of a shorter code word; for instance 
the code words 001 and 0010 cannot be present in the same set. This 
follows from the fact that two such code words can be distinguished only 
because empty spaces are present in one where Symbols are present in the 
other. Basing the identification of the code words on such a difference would 
be equivalent to using an empty space as an additional symbol, thereby in- 
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creasing the size of the coding alphabet. We can now state the theorem men- 
tioned above. 
Let 11, M2, ... Mx, ... Ny be a prescribed set of M positive integers. The 
inequality 
M 


(22) YD*<1 


k=1 


is a necessary and sufficient condition for the existence of a set of M different 
code words employing an alphabet with D symbols, whose lengths are equal 
to the prescribed integers, and none of which is a continuation of a shorter one. 
The proof of this theorem can be found in reference [3]. 

The average number of symbols per message is, by definition, 


(23) N— Ss P(Uy) My : 


k=1 


We wish to find upper and lower bounds to the minimum value of N for any 
given message ensemble. A lower bound can be readily found by relaxing the 
condition that the lengths of the code words be integers. Under these condi- 
tions we can minimize the right-hand side of Eq. (23) with respect to the vari- 
ables 1, M2, ... Mx; ... Ny, Subject to the constraint imposed by Eq. (22). The 
resulting minimum value is clearly a lower bound for N. We obtain for the 
optimum value of n, 


— log Plus) Tu) 


2 = 
tai Fe log D'UN tlog D” 
and for the desired lower bound 

a I(u) H(U) 
vo) Nin = pe e) Tog Dx 108=Dî 


It is interesting to note that this bound is equal to the entropy of the mes- 
sage ensemble divided by the capacity of the alphabet, as suggested by the 
above examples. Furthermore, the optimum length of the code word corre- 
sponding to each message is just equal to the self information of the message 
divided by the capacity of the alphabet. Thus if the self information of each 
message happens to be an integral multiple of the alphabet capacity, the 
optimum word lengths are integers, and a corresponding optimum set of code 
words is insured by the theorem stated above. For instance, in Figs. 2 and 3 
all the message self-informations were integers, and, as a result, the average 
number of binary digits per message could be made equal to the entropy of 
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the message ensemble. Thus our intuition was correct in suggesting that the 
sets of code words of Figs. 2 and 3 were optimum. 

An upper bound to the required average number of symbols per message 
can be obtained as follows. We observe, first of all, that the only reason why 
it is not possible in general to make the average number of symbols per message 
equal to the optimum value given by Eq. (18), is that the ratios I(u)/log D are 
not usually equal to integers. It would seem reasonable, in such cases, to 
make each n, equal to the integer n} just larger than the corresponding ratio, 
so that 


(26) 
On the other hand this inequality implies 


M M 
27) YD < ya = TP) =1. 
k=1 c 


k=1 k=1 


Thus, in view of the theorem proved in the preceding section, it is always 
possible to construct a set of code words with the lengths specified by the 
inequality (26). The resulting average number of symbols per message is given by 


ma H(U) 
‘ Vi > : * D(, Enr / 
ee “ite Pek ey) Da log D 


Finally combining this upper bound with the lower bound given by Eq. (25) 
yields 


| HEURE H(U) 
2 EIN : 
(29) eb e Sao i 


where N is the required average number of symbols per message. 

The per cent difference between the upper bound and the lower bound 
becomes negligibly small for large values of H(U)/log D. Let us suppose, for 
instance, that the messages consist of segments of sequences of independent 
x-symbols, belonging to ensembles having the same entropy H(X). The entropy 


of the message ensemble consisting of all possible segments of length n is then 
equal to 


(30) H(U) = nH(X). 


Then, substituting this equation for H(U) in Eq. (29) and dividing by » 
yields 


X)H NOTH 
31 (A)H _N 
eh) log D PL UT) 
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It follows that the minimum average number of code symbols per message 
symbol can be made as close as desired to H(X)/log D by making the message 
length n sufficiently large. Essentially the same result is obtained when the 
successive x-symbols constituting a message are not statistically independent. 
The above result constitutes the first fundamental theorem. Its importance 
lies in the fact that it provides a direct operational meaning for the entropy 
of a message ensemble as the minimum number of binary digits required on the 
average to encode a message. We might also say, from a broader point of 
view, that the theorem provides substantial evidence for the soundness of the 
postulates on which the statistical theory of information is based. 


4. — The second fundamental theorem. 


We saw in the preceding section that it is possible to represent the mes- 
sages of a given ensemble by means of sequences of symbols in such a way 
that each successive symbol is used at close to full capacity. This implies 
that the output of the first encoder can be regarded for all practical purposes 
as a sequence of independent, equiprobable symbols. For the sake of simplicity 
we shall assume, in what follows, that these symbols belong to a binary 
alphabet, and refer to them as binary digits. Since they can be regarded as 
independent and equiprobable, each particular sequence of n such digits will 
occur with probability 2-7. 

The second fundamental theorem in its simplest form concerns the problem 
of transmitting sequences of independent, equiprobable digits through a discrete 
channel with specified characteristics. Thus it involves the part of the system 
of Fig. 1 enclosed by a dotted line, namely the second encoder, the channel, 
and the first decoder. The goal is, of course, to reproduce correctly at the 
output of the first decoder the sequence of binary digits input to the second 
encoder, and employ for this purpose as few channel symbols as possible. 

Let us consider first the transmission properties of the channel. If we 
represent with x the symbol input to the channel and with y the corresponding 
output symbol, the channel is defined by the set of conditional probabilities 
P(y/x). We shall limit our discussion to stationary channels without memory 
that is to channels for which the values of P(y/x) are fixed and independent 
of the preceding input and output symbols. 

Let us denote by P(x) the probability of the symbol x. The average amount 
of information provided by y about « is obtained by averaging the mutual 

information between x and y over all pairs x, y. We obtain 


She Play 
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This expression can also be written in terms of entropies in the three alternate 


forms 

(33) I(X; Y) =H(X) + H(Y) —H(X, XY), 
(34) I(X; Y) = H(X) — H(4/Y), 

(35) (Xx; Y= Ay) = Ay) 


Furthermore, it can be readily shown that 
(36) IL Nib 


The right-hand sides of Eqs. (33), (34) and (35) provide three different inter- 
pretations for the average value of the mutual information. According to 
Eq. (33), it can be interpreted as the difference between the average amount 
of information necessary to specify æ and y separately (as if they were inde- 
pendent) and the amount necessary to specify them together as a pair. In 
other words, I(X; Y) is a measure of the statistical constraint between « 
and y; it vanishes when x and y are statistically independent. On the other 
hand, Eq. (34) indicates that I(X; Y) is the difference between the average 
amounts of information necessary to specify x before and after the reception 
of y. The conditional entropy H(X/Y)is often referred to as the « equivocation » 
because it represents the uncertainty about x that remains after the reception 
of y. Finally, Eq. (35) expresses Z(X ; Y) as the difference between the average 
amount of information that y is capable of providing, and the amount neces- 
sary to specify the channel disturbance. The conditional entropy H(Y/X) is 
sometimes referred to as the «noise entropy ». 

Our first objective is to determiné the information capacity per channel 
symbol. We observe, in this regard, that in general, each output symbol 
provides information not only about the corresponding input symbol, but 
also about all the preceding input symbols. It can be shown, however, that, 
for given symbol transmission probabilities the average information provided 
by each received symbol is a maximum when successive transmitted symbols 
are Statistically independent. Thus, in computing the information capacity 
of the channel successive symbols can be regarded as Statistically independent. 
It should be stressed however, that this is true only for channels without 
memory, that is only when the values of the conditional probability P(y/a) 
are independent of all preceding symbols. 

The average mutual information given by Eq. (32) is a function of the values 
of the transmission probabilities P(x) as well as of the characteristics of the 
channel, represented by the values of the conditional probability P(y/x). Let © 


ta C= MAX pg) IX) ’ 
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be the maximum average value of the mutual information between x and y, 
obtained by varying the symbol transmission probabilities. In view of the 
preceding argument, © is an upper bound to the average amount of information 
that can be provided by each received symbol about the corresponding trans- 
mitted symbol and all preceding symbols. We shall refer to 0 as the channel 
capacity. 

Let us evaluate, as a simple example, the capacity of a binary symmetric 
channel in which the digits 0 and 1 are received incorrectly with probability p, 
and correctly with probability q, 


(38) Ply = 0/a = 0) = Ply = 0/w7 = 0) =q=1—p, 


(39) P{y = 0}x = 1) = P(y = 1} = 0) = p. 
We obtain for the noise entropy 


(40) H(Y/X) = — > > P(æ) P(yJx) log P(ylæ) = — p log p — qlog q, 


Because of the symmetry of the channel this expression is independent 
of the transmission probabilities. It follows from Eq. (35) that the maximum 
value of I(X; Y) can be obtained simply by maximizing H(Y). In our particular 
case, the maximum value of H(Y) is equal to one, and it is achieved when the 
digits 0 and 1 are received with equal probabilities. This implies, in turn, 
that the two digits are also transmitted with equal probabilities. Thus, we 
obtain for the capacity of the binary symmetric channel 


(41) C=1+plogp+qlogq. 


Base two logarithms are used in accordance with the convention established 
in Sec. 2. Clearly, C is equal to unity for p= 0, that is when the probability 
of incorrect reception is equal to zero; it decreases monotonically with in- 
creasing p and vanishes for p = +, that is when the received digit is statistically 
independent of the transmitted digit. 

It should be stressed that while the evaluation of the capacity is very 
simple in the case of a binary symmetric channel, il may become very 
involved in other cases. The source of difficulty is that the maximization 
of I(X; Y) must be carried out under the constraint that the values of the 
transmission probabilities be non-negative numbers. 

Let us turn our attention next to the encoding and decoding operations 
performed by the second encoder and by the first decoder in Fig. 1. For the 
sake of simplicity, we shall limit most of our discussion to the case of a binary 
symmetric channel. In such a case the encoder and the decoder transform 
sequences of binary digits into other sequences of binary digits. The overall 
objective is to reproduce correctly at the output of the first decoder the sequence 
of binary digits input to the second encoder. 
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We observe, first of all, that the amount of information necessary to specify 
each of the digits input to the second encoder (message digits) is equal to one 
unit, in view of the fact that the digits are, by assumption, equiprobable and 
statistically independent. On the other hand the channel capacity is smaller 
than unity. It follows that the reception of a channel digit cannot provide 
enough information to specify uniquely one message digit. More precisely, 
the correct reproduction at the output of the first decoder of N, message digits 
certainly requires the transmission of N channel digits where 


N; 
(42) N>—. 
G 
We shall refer to 
ING 
43 Ri= == je 
(43) ge 


as the rate of transmission per channel digit. 

It should be clear that, since there are 2" possible sequences of N; message 
digits and 2” possible sequences of N channel digits, there is a great deal of 
freedom in the assignment of channel sequences to message sequences. The 
coding problem consists of the selection of a set of 2" channel sequences out 
of the possible 2” sequences, in such a way as to maximize the probability 
that the first decoder will reproduce correctly the message sequence. The 
overall operation of the part of the system enclosed by dotted lines in Fig. 1 
is illustrated schematically in Fig. 4. 
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Fig. 4. - Schematic illustration of coding and decoding operations for trasmission through 
a binary symmetric channel. 
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The binary channel is shown in Fig. 4 as a pulse communication system 
in which a positive pulse corresponds to the digit 1 and a negative pulse to 
the digit 0. The pulse detector at the receiver is considered as part of the 
given channel, and it is assumed that a pulse of either polarity has a proba- 
bility p of being mistaken for a pulse of the opposite polarity. The second 
encoder consists of the storage device and the computer shown in the upper 
part of the figure, while the first decoder consists of the storage device and 
the computer shown in the lower part of the figure. 

The encoding operation is performed as follows. The input message digits 
are stored in blocks of N, digits. These digits are then transformed into an 
equal number of corresponding pulses. An additional number N, of checking 
pulses are generated by the computer from the N, message digit, according 
to suitably selected rules. The entire sequence of 


pulses is then transmitted through the channel. In Fig. 4 N, is equal to N, 
so that the transmission rate per pulse is equal to one half. This particular 
method of assigning sequences of pulses to sequences of message digits can 
be shown to be sufficiently general in the sense that it does not limit in any 
substantial manner the overall performance of the system. 

The sequence of N pulses output from the detector will include, in general, 
pulses with incorrect polarity. The function of the computer is to select from 
the set of 2” sequences of pulses that correspond to message sequences, the 
one that differs from the received sequence in the least number of pulses. 
Once this pulse sequence has been determined, the computer generates the 
corresponding message sequence which constitutes the output of the first 
detector in Fig. 1. It can be shown that this decoding procedure minimizes 
the probability that any one of the N, output message digits be incorrect. 
We shall denote with P, this probability of error. 

Unfortunately our present knowledge of the optimum rules for computing 
the additional N, pulses is very limited. Dr. SLEPIAN will speak (*) in some detail 
about this problem. On the other hand, P. ELIAS[4] has been able to compute 
upper and lower bounds to the minimum probability of error that can be 
achieved in principle. These bounds can be written in the form 
(45) oe ee Pe ey 
where H, and E, are functions of the channel capacity © and of the rate of 
transmission R, but are independent of N, and K, and K, are quantities that 
vary very slowly with N. The quantities E, and HF, are positive for R < C 


(*) See this issue, page 373. 
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and vanish for R= C. Thus it is possible to make the probability of error 
as small as desired by increasing the length of the message and channel sequences 


while keeping their ratio, 


(46) 


N, 


h=—=<(0, 


N 


constant. In other words, it is possible by proper encoding to transmit at 


. 


Upper bound 


C=1+plog,p +(1-p)log,(1-p) 
p=005 C:07136 
=0.5 


parie 
N* N 


Fig. 5. — Upper and lower bounds 

to the probability of error for a 

message consisting of N, binary 

digits and encoded into N, + N, 
binary pulses. 


any finite rate smaller than the channel 
capacity with a vanishingly small probability 
of error per message sequence. This is the 
second fundamental theorem for the binary 
symmetric channel. 

The curve shown in Fig. 5 illustrate the 
behavior with N of the upper and lower 
bounds for a channel with p= 0.05 and for 
a transmission rate R= 0.5. In this particular 
case the exponents £, and £, are equal as 
indicated by the fact that the two curves 
become parallel straight lines for 
values of N. 

The functional dependence of E, and Æ, 
on the pulse error probability p that charac- 
terizes the channel, and on the transmission 
rate R is best expressed in terms of the 
parameter > defined by 


large 


(47) R=C(r)=1+4,7rlogr+(1 —r) log (1 —1r). 


This function whose behavior is illustrated 


in Fig. 6, becomes identical with the channel capacity when we set r— De 
Let us define, in addition, the quantity 


(48) T(r) = 1+ 7 log p + (1—r) log (1 — p). 


This quantity is a linear function of r for a given p and becomes equal to 
the channel capacity for r= p, 


(49) T(r = p)= Or=p)=C. 


It is represented in Fig. 6 as a straight line tangent to the curve C(r) at the 
point r= p. 
The exponent Æ, is given by 


(50) #£, = Or) — T,fr) , 
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which, in Fig. 6, is the vertical distance between the curve and the tangent 
for the value of r for which C(r) is equal to the desired rate R. The expression 
for the exponent £, depends on wether, for the desired rate R, the parameter r 
is larger or smaller than the critical value 7, 
defined by 


AL p 
i dL 90 


(51) | 


For r smaller than r,, E; is equal to E,, 


(52) HE, = HE, = Or) — Tr); ror. 


or—4 
p È 05 
4 2 
On the other hand, 200) 2 
4 | 


(53) #, = C(r) —[20(r) + 
+T7,(r.)—20(r.)|\< Fi; ren. 


The curve representing the term in square 
brackets is tangent to the straight line repre- Fig. 6. — Graphical determi. 
senting 7,(r), as illustrated in Fig. 6, so that nation of E, and E,. 
the curve can be regarded as a continuation 
of the straight line for r > r,. Thus the exponent £, can be interpreted as the 
vertical distance between the curve C(r) and the new curve. 

It is important to note that, since E, = E. for r<r., the upper and lower 
bounds have in this region the same exponential behavior with N. As a matter 
of fact the proportionality factors X, and K, are both proportional to N ~4, so that 
their ratio is independent of N. Thus, in this region, the probability of error is 
effectively bracketed by the two bounds, as illustrated in Fig. 5. On the other hand, 
forr > r,, E, < E, and the two bounds diverge exponentially. Thus the probabi- 
lity of error is only roughly bracketed by the bounds for low transmission rates. 

The second fundamental theorem is the cardinal result in the statistical 
theory of information. ©. E. SHANNON was the first to show in 1948 [1] that, 
in any channel without memory, the probability of error can be made as small 
as desired by proper encoding, for any transmission rate smaller than the 
channel capacity. This result was refined and extended by A. FEINSTEIN 
in 1954[5]. He proved, in particular, that the probability of error can be made 
to vanish exponentially with increasing message length. The exponential 
upper bound on the probability of error for the binary symmetric channel 
was developed independently by P. Extras and C. E. SHANNON. More recently. 
C. E. SHANNON developed[6] a similar upper bound for arbitrary discrete 
channels without memory. 
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We have had the pleasure of hearing Drs. Mc MILLAN and FANO present 
in some detail the discrete case of the mathematical theory of information 
originally developed by CLAUDE SHANNON. The portion of the theory that 
they have discussed is now rather fully developed. The obvious and 
important questions have been asked and well answered. There is of 
course yet work to be done, extensions and generalizations to be made, 
new corners to be explored. But by and large, this portion of the theory is 
in good shape. Drs. MCMILLAN and FANO have been kind enough to leave for 
me to discuss that part of the theory about which no one knows anything. 
I can, therefore, safely feel well qualified, to speak. 


1. — I shall discuss what is known as «the coding problem» of infor- 
mation theory. Actually there are two very distinct coding problems. One 
concerns information sources and may be called the problem of redundancy 
removal. The other concerns channels and may be called the problem of com- 
batting noise. Both problems can be further subdivided into two cases as shown 
in Table I. 


TABLE I. — The two coding problems. 


Source 


Channel 


Remove redundancy 


Combat noise 


Codes of SHANNON, 


television signals 
? 


Discrete 
FANO, HUFFMAN, ? 
SCHUTZENBERGER, etc. 

Continuous | Work on speech and 
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The discrete case of the source coding problem is well in hand now. 
The problem is the canonical representation in most efficient form of the 
encoded version of the message given the size of the new alphabet. This 
is the encoding problem that Prof. FANO has spoken about. The entropy 
of the source in bits measures the average number of zeros and ones per letter 
of the original message necessary for this efficient encoding. Coding techniques 
by SHANNON, HUFFMAN and Fano tell us explicitly how to perform the en- 
coding. Researches by SCHUTZENBERGER, SARDINAS and others have answered 
many questions about the nature of the coders. Here our knowledge is rather 
complete. 

It has been mentioned several times that much of the theory presented 
for the discrete case can be extended to continuous messages. When we attempt 
to make this extension for the redundancy removal problem we encounter a 
number of complications of a non-trivial sort. Consider for example an ensemble 
of continuous messages representing speech. It is clear that not all the fine 
detail of the waveforms is pertinent to speech. We feel that the message in 
this form has much redundaney, and we naturally seek a minimal encoding 
of the speech ensemble into sequences of zeros and ones. But minimal in what 
sense? Here is the difficulty. When we consider the efficient representation 
of discrete messages by zeros and ones, we require our encodings to be uni- 
quely decipherable into the original messages without errors. When we encode 
speech into sequences of zeros and ones we require that they be capable of 
decipherment into speech sounds in some sense as acceptable to the listener 
as the original uncoded, version. Our constraints on the minimization problem, 
then, are subjective in nature, and until such time as they can be stated in 
mathematical form much of the work on efficient encoding of speech must be 
of an experimental nature. A great deal of research is going on in laboratories 
throughout the world on the redundaney removal encoding problem both 
for speech and television. Time does nor permit further discussion of 
it here. 

As regard the second coding problem, that of combatting noise on chan- 
nels, our knowledge to date is very fragmentary indeed. The literature on 
this subject has grown markedly in the last several years, however, and I 
Shall have little trouble filling the remainder of my talks with selected findings 
from a few of these papers. 

The fundamental theorem states that given a channel with capacity € 
and an information source producing messages with entropy H, there exists 
an encoder and a decoder such that messages from the source can be encoded, 
transmitted over the channel, and decoded with an arbitrarily small pro- 
ae SRE ro ant that that H — o It H > 0, it is impossible 
ae eve see: er and decoder: the probability of error will be boun- 

ê o and we cannot make it arbitrarily small. This theorem 
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is an existence theorem. As proved today, it does not show us explicitly how 
to construct such encoders and decoders. 

From the practical point of view, this theorem contains the golden fruit 
of the theory. It promises us communication in the presence of noise of a 
sort that was never dreamed possible before: perfect transmission at a reason- 
able rate despite random perturbations completely outside our control. It 
is somewhat disheartening to realize that today, ten years after the first state- 
ment of this theorem, its content remains only a promise, that we still do not 
know in detail how to achieve these results for even the most simple non-trivial 
channel. 

In order to understand better both the theorem, its proof, and our diffi- 
culties in achieving the results promised by the theorem, I am going to restrict 
. my attention for most of the remainder of my talks 


to one particularly simple channel. The problem for q_P 
more complicated channels differs only in detail but ven TIT ga 
not in overall character. 2 

Let us direct our attention to the binary sym- A be ele 
metric channel already discussed by Dr. FANO cor ide 
(see Fig. 1). 


Fig. 1. — Binary sym- 


For ease in talking about the channel, we shall ; 
metric channel. 


suppose it can handle binary digits at the rate of 1000 
per second so that we shall measure its capacity 
in bits/second rather than in the somewhat more awkward units bits/binary 
digit transmitted. The capacity then is C=1000(1+ p log p+qlogg) bits/s. 
with logarithms taken to base 2. A plot of C versus p is shown in Fig. 2. 
If, for example, p —.001, a not unrealistic value, 
it is found that © = 988, so that according to the 
fundamental theorem, we should be able to transmit 
messages with any entropy H < 988 bits/second over 
the channel with as little error as we desire. How 
Shall this be done? 

For simplicity, let us take as our message source 
an experimenter tossing a fair coin at some fixed 
Fig. 2. — Capacity of bi- rate, 4.e., some fixed number of tosses per second. 
nary symmetric channel. If we denote heads by one and tails by zero, the 

experimenter generates a sequence of binary digits 
whose entropy rate in bits/second is equal to the rate at which the coin is 
tossed. We wish to transmit the results of this experiment over the channel 
to a destination. . 

If we encode the message digits one by one directly into the channel digits, 
we can transmit up to 1000 digits per second, but there will be probability p 
that each received digit is in error. One obvious way to decrease this error 
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probability in digits passed on to the destination is to require the experimenter 
to slow down. If, for example, he tosses his coin at the rate 333 3 tosses per 
second, we have time to repeat each digit two additional times in trans- 
mission (see Fig. 3). 


1 0 1 L 0 1 experimenter’s message 

OOO Lr IT DO D UE transmitted sequence 

OO OLAS ONO gO 181 received sequence 

1 0 1 L 0 1 decoded sequence 
Fire 


The received digits will in general differ from those transmitted. If we 
break up the received digits into blocks of three, however, and decode by the 
majority rule « one if more ones than zeros in a block, zero if more zeros than 
ones in à block », we will correct all errors in transmission provided only that 
two or three errors never occur in one block. An elementary calculation shows 
that the error probability in the digits passed on to the destination is now 
1075 instead of 107% as would result if the experimenter’s digits were trans- 
mitted without repetition. 

The error probability can clearly be reduced further by the same technique. 
If we require the experimenter to toss the coin at the slow rate of 200 tosses 
per second, we have time to repeat each digit 5 times in transmission. If we 
break up the received digits into blocks of length 5 and again use a majority 
rule, an error probability of 10-3 results. 

We can of course carry this scheme further and further. The evident dis- 
advantage is that as we decrease the error probability more and more, we 
require the experimenter to signal more and more slowly. Indeed in the limit 
of zero error probability we require him to stop generating messages alto- 
gether. 

Now the main point of Shannon’s fundamental theorem is precisely that 
this slow down is not necessary to achieve arbitrarily small error probability. 
In the case at hand, it asserts that our experimenter can toss the coin at any 
fixed rate less than 988 tosses per second, and by being sufficiently clever 
we can deliver the results to a destinations with as little error as we 
choose. 

To gain insight into how this might be done, let us suppose the coin is 
tossed at the rate 500 tosses/second. We then have time to transmit two 
digits on the channel for each digit generated by the experimenter. A little 
thought Shows that we can gain nothing by sending a pair of digits each time 
a coin toss is made. Instead, let us break up the experimenter’s digits into 
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- à blocks of length two. We have time to replace each block of length two by 


a block of lenght four. We do so by the encoding dictionary shown in Fig. 4. 


00-0000 
10-1001 
01-0111 
Pee 0 


Fig. 4. 


The left side lists the four possible blocks; the right side the corresponding 
sequence of four digits that is transmitted. At the receiver, the messages are 
broken up into blocks of four digits. Sixteen different blocks are possible. 
They all appear on the left side of the decoding dictionary shown in Fig. 5. 


OUDIOLOME OOO. OLD OO OO CE 0 0.0 
0010 0-04. at LOL ol Od 1s 1.0 
One Tee Oreo? COSETTE 
Pele ele ACL tees lek UE Li oa 1 


resto: 


The right side of the dictionary shows the corresponding decoded message 
that is passed on to the destination. For example, if the received block is 
either 0000, 1000, 0100, or 0010, then 00 is passed on to the destination. 
Study of the first row of this table shows that if 0000 is transmitted it is de- 
coded exactly into 00 not only when no errors occur in transmission but also 
when there is a single error in the first, or the second, or the third digit trans- 
mitted. Precisely the same statement holds for each of the three other allowed 
transmitted blocks of digits, 1001, 0111 and 1110. 

The probability of error in digits given to the destination can be com- 
puted to be p= 1/8000. 

We can reduce the error probability further while still permitting the 
experimenter to toss his coin 500 times per second by an obvious generali- 
zation. We break the experimenter’s messages up into blocks of length three. 
There are eight such possible blocks. We prepare a dictionary or code book 
that replaces each of these eight blocks by a suitably chosen sequence of six 
binary digits to be transmitted over the channel. At the receiving end, we 
break the messages up into blocks of length six. There are 64 = 2° possible 
blocks that might occur. We prepare a decoding dictionary that tells how to 
replace each such block of six digits by a block of three digits. These are 
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passed on to the destination. With the best such encoding and decoding 
dictionary, it can be shown that the error probability is now reduced 
LOMLOUES 

This procedure can be carried on indefinitely. In the general case, the 
message is broken into blocks of length k. Each of these 2* blocks is replaced 
by a block of n binary digits for transmission over the channel. The received 
message is broken into blocks of length n. A code book tells how to decode 
each of these 2” sequences into an appropriate sequence of k digits. The 
fundamental theorem and its proof show that by making k and n sufficiently 
large and by using appropriate encoding and decoding dictionaries of this 
sort, the probability of error can be made as small as desidered provided only 
that k/n < C. 

We do not know how to explicitly construct the dictionaries mentioned 
above. For a given k and n there will clearly be a minimum value of error 
probability obtained. We do not know this value except for very small values 
of k and n. Upper and lower bounds for this best error probability that are 
asymptotic results for large k and n are known, however. These results show 
that the error probability can be made to decrease exponentially with n for 
fixed k/n. From these bounds, it would appear that codes with values of n 
from 50 to 100 would be extremely useful in application. 

However, the code book method described here is clearly out of the question 
for such values of n, since it must list 2” entries. If, indeed, such codes are 
ever to be used in practice, they must have Special features, perhaps an al- 
gebraic structure, which permit coding and decoding by some calculation 
technique rather than by dictionary. 

The later portion of my talk will be devoted to a discussion of some of the 
codes that are known. Before proceeding to these however, it seems best to 
pause in order to prove the fundamental theorem. The proof to follow is an 
adaptation by E. N. GILBERT to the binary symmetrie channelof a more ge- 
neral proof due to A. FEINSTRIN. The proof will, I hope shed some light on 
the nature of the coding problem. 

We shall need a few mathematical results of a secondary nature which we { 
now establish as preliminaries in order not to break the chain of argument at 
a later point. Suppose we have m distinguishable objects and m different 
colors of paint. The number of different ways in which we can paint the objects 
is m". This is certainly greater than the number of ways in which we can 
color the objects when we are restricted to use the first % paints on only % of 


the NS latter number is | n Atm — k)"»-*, for we can choose the k 
objects in | if Ways; once chosen they can be colored, with the first k paints 


in k* ways. The remaining m—k objects can be colored with the remaining 
m_— k colors in (m — k)"-* ways. 
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(1) (©) odi 


The next result needed concerns the number w of ones to be found in 
a sequence of » binary digits when successive digits are produced indepen- 
dently with probability p for a one and probability g=1 — p for a zero. The 


probability of finding exactly % ones in such a string is Pr (w=k) = (:) be ph 


It is easy then to show that Hw = np and that ca — np)? = np(1— p). 
Let us now define the number b by 


©) » = à wa =P) 
€ x 


where e is a small number given in advance. 
From E(w — np)? = np(1 — p) it follows by definition that 


> Pr (w= h)(k — np)*=np(l — p) , 
so that 


x Pr (w k)(k — np} < np(1 — p) 


k>np+b 


or 


b? > Pr(w=k) <np(l=?). 


k>np +b 


The sum, however, is the probability that w be greater than np +b. 
Combining this result with (2) gives 


(3) Pr(w>np+b)< 


we 


Those familiar with the Chebychev inequality will recognize this as a special 
case. 

A few definitions and another inequality will complete our preliminaries. 
We shall be concerned, with the 2” n-place binary sequences. We shall refer 
to the sequences as points and will denote different sequences or points by 
letters such aS 2, 2%, etc. 

We define the distance between two sequences to be the number of places 
in which they differ, so that, for example, the distance between 11010 
and 01100 is three since these sequences differ in the first, third and fourth 
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places. By a sphere of radius r about the point x we shall mean the set of 


all points distant 7 or less from #. 
We shall use the symbol A(x) to denote the sphere of radius np+b about 


x where b is given by (2). The number of points in such a sphere is 


(4) NaS a 


k=0 


n 
k 
is large and p<}, np+b<n/2 so that the largest term in (4) is the one 
for k=np+b. Thus 


since there are exactly | | points distant k from any given point. When # 


È n n DPR n° 
4*2\np+b5} ~2 (np + b)"?*?(ng — by" ?? 
on using (1), or what is the same 


N,< n Ao 1 Aas | 
2 (p + (b/n))"*+°(q — (b/n)na 


(5) 


We are now ready to prove the fundamental theorem for the binary sym- 
metric channel. We consider transmitting sequences of length » over the 
channel. We shall choose a particular subset of the 2” points Dy, Dy, ..5g ee 
to signal with. We denote these special points by X,, X2,..., X x) and call 
them code points. 

We shall also choose K disjoint sets of points R,, R,,..., R,, which will} 
define our method of decoding. If a received sequence of n digits is in R,, 
we shall assert that A; was transmitted, j — 1, 2, pegs 

The sets R,, R.,..., R,, are called detection regions. 

When a received point does not lie in any detection region we make no 
decision and count this as an error. 

We now proceed to choose the code points and detection regions. X, is 
chosen arbitrarily. R, is taken to be the sphere A(X,). When X,, is trans- 
mitted, the received Sequence will lie outside PR, only if np+b or more errors 
have occured in transmission. The errors that occur in transmission can be 
represented by a string of ones and zeros. The ones and zeros representing 
the errors occur independently with respective probabilities p and q=1—p. By 
2, then the probability that the received message lie outside R, when AG 
nt less than e/2 so that the probability that X, be decoded in 

We now proceed to choose X,. The region R, will be taken as that 
part of A(X.) that is disjoint from R,:; We choose for X, any point such 
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that when transmitted the probability that the received point lie in R, be 
greater than 1—e«. The probability that X, be decoded in error is less than e. 
Thus we have allowed a possible small overlap of the spheres A(X,) and 
A(X,). The probability that the received point lie in this overlap region when 
X, is transmitted must be less than e/2. 

We proceed in this manner selecting successive code points and detection 
regions. The region FR; is the portion of A(X,) that is disjoint from R,, R,, 
…, R;. The point X; must be chosen so that when YX; is transmitted the 
probability that the received point lie in R; be greater than 1 —e. We con- 
tinue in this manner until no new points can be found which qualify as code 
points. We suppose that a total of K points have been found. From the 
method by which the code points were chosen and the detection regions formed, 
it is clear that the probability that any transmitted point be decoded in error 
- is less than s. The burden of proving the fundamental theorem, then, will 
be in showing that K is sufficiently large. To this we now turn our attention. 

Let R be the union of the detection regions and N, the number of points in 
R. Since R is composed of K possibly overlapping spheres each containing 
N, points, V,<KN, or 


(6) edo 


From (5) we have a bound on N, in terms of the channel and code para- 
meter to continue the inequality (6). There only remains the task of finding a 
bound on N,. 

To do so, consider the experiment of picking any sequence # at random 
(i.e. each sequence has probability 2-" of being chosen) and transmitting it 
over the channel. Let y be the received sequence. We observe that 


(7) Pr (yeR) = > Pr (x sent) Pr (yeR/w sent). 
alla 

Now Pr ye (R/« sent) > e/2 for all #. It is certainly true for the code 
points X,, X.,...,. Xx. If it were true for some other point « that 
Pr (yeR/« sent) < e/2, then the probability that y be contained in the overlap 
of A(z) and R when x is sent would also be < e/2. The point « would thus 
qualify as a (K+1)-st code point contrary to assumption. 

Now Pr (x sent) — 2 ” so that (7) becomes 


(8) Pr (received point be in À) > 


LD] © 


We also have 


(9) —® — Pr(xeh) = > Pr (# sent, y received). 
cel 
ally 
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But 


1 à 
Pr (x sent, y received) = = Pr (y received/x sent) = 


— Pr (x received/y sent) = Pr (y sent, æ received) 


2 


ro! 
since the error pattern which changes x to y is the same as the error pattern 
which changes y to x. Thus from (9) 


et = > Pr (y sent, x received) = Pr (received point be in R). 
wek 


ally 


Combination of this result with (8) yields 


From (5) and (6) then 
b np +b b nqg-b 
ESA ergal rane 


On taking the log and dividing by », one finds 


log K a 
(10) SE >1+logg + qlog q—0(1] vi) 


where the terms indicated by 0(1/Vn) all vanish for large n at least as fast 
as 1/Vn. The left side of (10) is the rate of transmission. 

Thus by making » sufficiently large we can trasmit at a rate as close to” 
the channel capacity C=1+p log p+-q log q as desired with a probability of 
error less than e for each code point. 

This is the fundamental theorem. 


2. — Let us consider now in further detail the matter of constructing codes for 
the binary symmetric channel. A geometric interpretation of the problem is 
helpful in understanding it. The 2” n-place binary sequences can be repre- 
sented as the vertices of a unit cube in an n-dimensional Euclidean space. 
The most convenient way to do so is to regard the successive digits of a binary 
Sequence as the successive co-ordinate values that locate the point. Fig. 6 
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- Shows the representation of the 8 3-place binary sequences as the vertices of 
a cube in 3-dimensions. 

The selection of a particular code for use on the channel corresponds to 
specially designating Æ of the vertices of the cube. When we transmit one of 
these code points, in general the re- 
ceived sequence is represented by a dif- 
ferent vertex of the cube. If a single 
error occurred in transmission, the re- 
ceived point lies one edge length away 
from the point sent. If two errors occur, 
the received point lies two edges away; 
if j errors occur it lies 7 edges away. 
Now the probability of j errors is 


Fig. 6. 


(") pig’) which for p < + is monotonic 
decreasing with increasing j7. Thus the received point is most likely to be near 


the transmitted point: less likely to be far away. This comment tells us how 
to design the best detector for any given code. With each code point, X,, 
is associated a region, R,, such that every point in FR; is at least as close to X, 
as it is to any other code point. Decoding is performed by asserting that the 
transmitted message was _Y, whenever the received point lies in R;: It is not 
difficult to show that this choice of the detection region È, leads to minimum 
probability of error for the given code. Such a decoding scheme is called 
«a maximum likelihood detector ». 

In principle, then, the design of the decoding dictionary is clear. The 
problem, then, is how to designate the special vertices that are to serve as the 
code. The number of these points is determined. by the rate at which we wish 
to transmit. Their disposition on the cube will determine the error proba- 
bility to be achieved with the code. 

From the geometrical picture, it is easy to see that in some not too precise 
sense the points of a good code are as far apart from each other as possible. 
If each of the regions R; contains the sphere of radius e about X,, then clearly 
the code will correct all single, double, ..., e-tuple errors that may occur in 
the transmission of any code point. Such a code is called an e-error-correcting 
code. 

How many code points can an e-error-correcting code on the n-cube have? 
The answer is not known. Many upper and lower bounds have been given 
in the literature, however. We mention only one, 
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The right side of this inequality is particularly easy to establish. Each sphere 


of radius e contains > (") points. The K spheres may not exhaust all points 


j=0 


e n 
of the cube, so that K > (") < 2", 


j=0 

It may happen for certain values of n and e that it is possible to find a 
set of K disjoint spheres of radius e that do exhaust the cube. Such codes are 
called «close packed e-error correcting codes». They have some very de- 
sirable properties. Unfortunately, there are not many such codes. A neces- 

‘ e n 
sary condition for the existence of such a code is clearly that >(*) be a 
j=0 


power of 2. For example, if e=1, we require (") + (")=140= 2', say, OT 


n—2'—1. This necessary condition turns out to be sufficient in this case, 
and for n=2*— 1, t=1,2,... there do exist close packed single error cor- 
recting codes. These are the well known Hamming codes. 


For e= 2 one finds 5 (") a power of two for n=2, 5 and 90. It has 

3=0 \ 
been shown by H. 8. SHAPIRO that these are the only values of » for which this 
is true. For n=2 and 5 quite trivial codes result. LLOYD has shown that 
for n= 90, no close packed 2-error correcting code exists. 

And so it goes. One can investigate the close packed codes one by one and 
obtain interesting number theoretic properties, but as a class of codes they 
do not appear to be too useful. One general theorem due to H. S. SHAPIRO 
is worth noting before leaving this subject. For any fixed e > 2, there are 
only finitely many values of n for which Y (") is a power of 2. 

j=0 

Let us turn now to another class of binary codes which has received some 
attention. These are called group codes or parity check codes. Before des- 
cribing them, let me remind you briefly of the mathematical notion of a 
group. 

Let I, A, B, O,... be a distinguishable set of objects (called elements). — 
Let there be given further a law which associates one of the objects with each 
ordered pair of the objects. This rule is usually written as multiplication, so 
that, for example, if 0 is associated with the pair A, B in that order one writes 
AB=C. If the collection of objects and the rule have the following properties, 
they are called a group: 


1) There is a unique element 7 such that for every element A of the 
collection TA = AI = A. 


2) For every element A of the collection there is a unique element called 
the inverse of A and denoted A-1 such that AA-1— 4-14 — I. 
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3) For all elements A, B, C 
A(BC) = (AB)C. 


These are not a minimal set of postulates for a group. Some of the statements 
are derivable from others. Typical examples of groups are: 1) the set of all 
positive and negative integers and zero (here the law of association is ordi- 
narily addition); 2) the set of all non-singular n x” matrices. (Here the group 
multiplication is ordinary matrix multiplication). 

We note that the set of all n-place binary sequences form a group when 
the group multiplication law is addition modulo 2 of the sequences term by 
term. Thus if À — 10111, B= 01101, AB means 10111+ 01101 — 11010. 
The identity element of the group is the all zero sequence. Each element is 
its own inverse. We denote the group by B,. 

A group code is any set of n-place binary sequences that form a subgroup 
of B,, 1.e., that are also a group under the same law of multiplication. For 
example, 0000, 1001, 0111, 1110 is a group code. The modulo two sum of any 
two of these sequences is again in the collection of sequences. 

Group codes have many special properties of interest. I can describe only 
a few of them in the short time remaining. 

In the first place, the maximum likelihood regions À; can be described 
in a simple manner for these codes. Let us list the elements of the code in a 
row (first row of Fig. 7). Here J stands for the all zero sequence, and each 
of the symbols A,, 43, ... stands for a particular n-place binary sequence. The 
elements of this row form a group, so that the product of any 


us Ag A, casa A, 
Se S,A, S.A; no S,A 
Di SA; S;,A; nuca SA, 
S, SA: S,As Be SEAT 
pie", ea ee LE 
IB Te 


two A’s is again an A. Let us now define the weight of a binary sequence 
to be the number of 1’s in the sequence. The first row of Fig. 7 does not 
exhaust B,. Of the elements in B, not in the code, choose one of minimal 
weight and call it S,. Form the second row of the table as indicated in Fig. 7. 


25 - Supplemento al Nuovo Cimento. 
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All of the elements in these two rows are distinct. (For S.A; = S,A; implies 
on multiplying by 8, 4;=A;. Also SsA;= A; implies on multiplying by 
A;, S:=A;A;. But 8, was assumed not in the first row.) Now of the ele- 
ments in B, not in the first two rows, choose one of minimal weight, call it 
S, and form the third row. Again it is easily seen that the elements listed are 
all distinct. We continue in this way until B, is exhausted. J assert that 
the columns of the array thus constructed are the maximum likelihood re- 
gions R;. That is, the region associated with A, is A;, S,A,, SA; SOA 

To prove this assertion, we must show that any element in the i-th column 
is at least as close to A; as it is to any other A. What we have given is that 
each S, is of minimal weight in its row. We need therefore first to connect 
weight with distance. Let d(R, S) denote the distance between elements ® 
and S; let w(R) denote the weight of element R. Then clearly 


a(R, S) = (RS). 
From this it follows that 
(1) ath, 8)=,URI, ST}, 
for any 7, since 
d(R, S) = w(RS), a(RT, ST) = w(RTST) = w(RST?) = w(RS) 
since 7? — I. Also we note 
(2) WITY=TA D, 1). 
By construction of the array of Fig. 7 w(S;) < w(S;A;) or from (2) 
d(S;, I) < d(S;A;, I) all è and 7.) 
By using (1), this becomes 
A(SiAms Am) < d(S;A;Aj;Am, AjAm) = USAm 41) alli, j, m, 


where we have set A; An=A,. For a fixed m, as j takes all values, so does, 1, 
so we have | 


US; Am, Am) <A(S;Am, A1) all i, 1, m. 


But this asserts that every element in column m is at least as close to A,, as 
to any other A. Q.E.D. 

If w is the weight of an element of B,, we associate a probability p"q" ” 
with the element. Let Q, be the sum of the probabilities associated with the 
elements of the first column of Fig. 7. Then it is easy to show that @ is the 
probability that any transmitted A be decoded correctly. 

Li Is convenient to define two group codes to be equivalent if one can be 
obtained from the other by the application of a fixed permutation of the digits 
to all the elements of one of the codes. E.g. 0000, 1001, 0111, 1110 is equi- 


1198 


oe 


CODING THEORY 387 


valent to 0000, 0110, 1011, 1101. The second code is obtained from the first 
by interchanging the first two digits and by interchanging the last two digits 
in each code point. 

It can be shown that every group code is equivalent to a parity check code. 
The latter class of codes can be characterized as follows. Let x,, %.,..., 4, 
be the successive digits in a code point: each x is zero or 1. In a parity check 
code, the digits of every code point satisfy a set of linear (mod 2) equations, 


k 
Xi = > a5 %; — k+1,..., N, Ai; = 0 or 1 
=1 


and all elements of B, that satisfy these relations are code points. There are 
2" code points. The first % digits can have any value and are called infor- 
mation digits. The remaining n — k digits are fixed linear mod 2 combinations 
of the information digits. They are called check digits. The encoding of mes- 
Sages using group codes is thus arithmetic in nature. No dictionary is needed. 
The incoming message is broken up into blocks of length k to be used as 
information digits. The n —k check digits are computed and adjoined to the 
information digits to form the block of n digits to be transmitted. 

A simplification also results in the decoding dictionary of a group code. 
For any received sequence of n digits, form the sequence of n—k digits 


rat .. Tn Where 


kt+1°k*2° 


Mi = 0;,+ > dt, dk ET, 0, (mod 2). 


This sequence is called the parity check sequence. It can be shown that 
all elements in any row of Fig. 7 have the same parity check sequence, 
and that no two rows have the same parity check sequence. To decode, 
then, one forms the parity check sequence for the received element 7. 
This identifies one of the S’s. The product ST then gives the maximum like- 
lihood estimate of the transmitted code point. 

Many of the notions just discussed are illustrated in Fig. 8. This will be 
recognized as the code used as an illustration in Sect. 1. The column at the 
extreme right lists the parity check sequence for each of the rows. 


0000 1001 0111 1110 —|00 
1000 0001 EU 0110 —|01 
0100 1101 0011 1010 —|11 
0010 1011 0101 1100 —|10 
Ly = Lo 3 = 3 + Lo 
Ly = Li + Ly Ta = La Li + Wy 
Fig. 8. 
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The problem of finding a group code with maximum @ for given n and k 


remains unsolved. 


3.- There have been many other special codes investigated both for the bi- 
nary symmetric channel and for other more complex channels. Time limitations 
prevent discussion of them here. Some of the codes might be useful in certain 
practical cases, but nothing like a general theory that leads in a constructive 
manner to the results promised by the fundamental theorem has emerged. 
The problem remains one of the most challenging in information theory. 
We have been promised the existence of communication systems with certain 
highly desirable properties. We do not yet know how to find them, nor do 
we yet know what price in complexity of equipment must be paid for this 
promised accuracy of transmission. 

The theory of communication must certainly be considered incomplete 
until answers to these questions have been found. 
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1. — Algebraic description and realization of linear sequence filters. 


A linear binary sequence filter [1] is a synchronous filter whose inputs and 
outputs are ordered sequences of binary symbols (0’s and 1’s). For the ge- 
neral non-time-varying filter each digit of the filter output sequence is a 
modulo-two sum of an arbitrary selection of past output digits (Z) and of 
present and past input digits (X). The description of a sequence filter in terms 
of a delay operator, D, is a straightforward one. For example, a filter whose 
output Z is the sum of the first and third previous output digits and of the 
present, first, second, and fourth previous input digits is described by 


(1) EDI DT YA DY pix a per 


where the + symbol is used here for the modulo-two operation. That is, the 
present output is zero if an even number of selected digits have the value 
one, and is unity if an odd number have the value one. 

Since the modulo-two operation is self-inverse, the terms in (1) may be 
rearranged to give 


(2-a) PZ UT 7 = Dex Dix DX Y 
or 
(2-b) (De Deane = (Deep Dee Tyy 


(*) Excerpts from IRE Transactions on Information Theory, Vol. IT-2, 
(Sept. 1956 T). 
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The «transfer ratio» of the filter is then 


Zo De ESDAII 
XxX. DEAD T 


(3) 


An efficient realization of this filter results from rearranging (1) to give: 


(4-a) Nit LED TEE Die eee Da 
or 
(4-b) X+Z= D{(X+Z)+D{+D{Z+DX}}}. 


The corresponding filter is given in Fig. 1-a. The «inverse » filter, whose input 
is Z and whose output is X is described by Eq. (4-a, b) and has a transfer ratio 


x D+D+I 


(5) FTA Dit DATI 


Its realization is given in Fig. 1-b. Both of the filters in Fig. 1 utilize only 
two kinds of elements: modulo-two adders and unit delays (single-stage shift- 


2 


(b) 


Fig. 1. - Chain realization of a binary sequence filter and its inverse. 


registers). The «chain » realization given both of these filters consists of a 
chain of unit delays with provision made for introducing the signals X, Z or. 
(Xt Z) between each two stages of delay. It uses just the number of delay 
units necessary to remember the input or output digit most remote in the 
Don which is needed for proper operation of the filter (in this case the fourth 
previous input), and an equal number of adders. 


When a binary filter and its inverse are connected in cascade, one mode of 
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‘operation of the combination is that for which the transfer ratio is the identity 
operator. In our example: 


BD+D+I1 PER) FT 


(6) nl D+ D +1I | 

That is, the second filter unscrambles the scrambling produced by the first. 
In the error-correcting scheme proposed in this paper the use of filters and 
their inverses will be of paramount importance. 


2. — Description of a filter from its impulse response characteristics. 


Later in this paper we will want to make use of the fact that there exist 
finite realizations of linear sequence filters whose response to an input « im- 
pulse » (a single digit 1 preceded and followed by infinite sequences of 07s) is 
arbitrary except that it must eventually die out (become the all-0 sequence) 
or ultimately become periodic. Suppose, for example, that we wish to realize 
a filter whose response to an input sequence, X*, containing an impulse is 
the output sequence, Z*, which ultimately becomes periodic (see Fig. 2a). 


A*:...000100000000000000000000 ... 
ZERO 0 OSIO Oncor Le OM OIOMSROMEOIONTE 
ST LIO PL QD POLITO LOLLI) 0,7: 
Zi: ...000.010000000000000000000.... 


2e 
I+D+D +D 


z* 


(Cc) 


Fig. 2. — Steps in the synthesis of a binary filter form a specified impulse response. 


Z* can always be considered to be the sum of two sequences: Z, the periodic 
component, and Z* the transient component. The filter we are trying to de- 
sign may, for the moment, be considered to be made up of two sub-filters, 
f, and f,, which have impulse responses Z* and Z;, respectively, and whose 
outputs are added to give the desired response Z* (see Fig. 2-0). 

The filter f, could be realized by a cascade of two other filters (see Fig. 2-c). 
The first would have an impulse response which consisted of a sequence of 
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impulses spaced seven intervals apart (the period of the periodic response) 
and continuing indefinitely. This filter would have a transfer ratio 


si 


(7) I + DT + DH + D+... L'AIDE 


The periodically recurring output of this filter could be used as the input to 
another filter having the proper transient response (finite in length). The latter 
filter has a transfer ratio which is a polynomial, 1+D-+D?-+ D* in this example, 
whose terms correspond to the positions of the 1’s in a typical cycle, con- 
tained between commas, of the desired ae 

The transient part, Z7, of the impulse response is easy to arrange for in 
our example. The proper associated filter, f,, has a transfer ratio D. 

The filter we are designing could then be realized with a total transfer 
ratio of 


pee! 
D+ 


7 
(8a) y= 


|a+D+D+D9+D, 


which may be rewritten as 


Z*_(I+D+D+D')+DD' +1) 


8b n | 
(90) x D'+I 

or as 

A Z* D+ D+ D? +I 


babi Ki LS à 


The numerator and denominator of the expression in Eq. (8-c) each contain 
the factor D+ D®?+ D+ (found using the Euclidean algorithm; see refe- 
rence [1]) which may be cancelled to give 


(84) 2° (Di Di + DOS D' + DAI RADI DAS 
da (DS + D+ TDD FT) DY Da, 


The transfer ratio is now in its simplest form and the filter may be synthe- 
Sized as has already been done in Eqs. (4) and Fig. 1. 
3. — A linear single-error correcting coding scheme. 

Consider the arrangement of filters shown in Fig. 3-a. A sequence of seven 


X digits is fed into a transmitter filter with transfer ratio 7, resulting in a 
sequence Z= (T)X which is transmitted through the noisy channel. In the 
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channel a noise sequence, N,.is added to Z so that what arrives at the receiver 
filter is 


(9-2) 2=54+N. 


At the receiver a filter inverse to the transmitter filter creates from the se- 
quence Z' a sequence 


wb) AX°=(T)Z—(T)ZL N) = (TA) [TX + NI = X+ (TN. 
If there were no noise in the channel (N= 0), X—X'. If there is noise 


present then the sequence X’ contains the sum of the transmitter input se- 


| ! 

— Transmitter ee Channel SR Receiver > 
| | 
| | 
2 DI 


Noise, N 
(a) 
X:(1110)000 Z': 1110010 
N:0010000 
Z: 1100 010 + TT00ILI 
(b) 


X: (1110)000 
(QE) RS AVON ileal 


Rie Xe (Tt) ON L008 11 
(c) 


Impulse response of the receiver filter with transfer ratio. 
TETTI SPE: 
010/0001 OM TET ONO 10 lates O1ORIIO AL Srey. 


— 


(d) 


Fig. 3. — An elementary example of the linear single-error detecting scheme. 


quence, X and the response (T-1)N of the receiver filter to the noise. If only 
a single noise digit is present, the sequence X' contains X plus the impulse 
response of the receiver filter superimposed thereon. 
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Let us examine the coding and decoding mechanism in more detail. The 
first four digits of the sequence X are information digits, and may therefore 
be chosen in 24= 16 different ways. The remaining three digits are always 
all zeros and are to be called here buffer digits. The composite block of seven 
digits is scrambled for transmission in the channel by the first filter. The 
sequence XY’ which results from the unscrambling action of the receiver filter 
would equal X if there were no noise in the channel. The clue to this possi- 
bility would be the existence of three zeros in the buffer positions (the last 
three digits) in the sequence X’. 

When a single noise digit is equal to unity (just one transmitted digit is 
changed by noise action) the received sequence, X’, may look quite different 
from X (see Fig. 3-b). In particular, the buffer positions will no longer con- 
tain all zeros, but will instead be three successive digits of the impulse res- 
ponse of the receiver filter. In our example the impulse response of that filter 
is given in Fig. 3-d, and since we have assumed the noise impulse to 
occur in the third position of the block of seven digits, we observe in 
the buffer positions the third, fourth, and fifth digits of the impulse response 
(see Fig. 3-c,d). 

It is extremely important to notice that the digits in the buffer positions 
of the sequence X' are independent of which of the sixteen possible X sequences 
is sent. This pattern of digits depends only upon the position(s) of the noise 
digits and upon the impulse response of the receiver filter. 

We have chosen the receiver filter so that the impulse response has a period 
of seven digits (the length of the composite block) and so that each of the 
seven possible combinations of three (the number of buffer digits and the 
degree of the denominator polynomial) successive digits in the response will 
be different from the others. That this is possible for a block length of 
m=2°—1 with b buffer positions follows from the fact that the maximum 
possible period of the impulse response of a filter with denominator poly- 
nomyal of degree b is 2° — 1 [1]. 

By observing the three buffer positions of the sequence X', and by knowing 
the form of the impulse response of the receiver filter, we can deduce where 
the noise impulse occurred in the block. If we assume that only a single 
noise impulse was present (the most likely situation) we can recreate the ori- 
ginal sequence X by adding (same as subtracting, modulo-two) the now known 
sequence (7-)N to the sequence X’. 

For our example it is interesting to examine the sixteen possible sequences, 
Z, whieh correspond to the sixteen possible sequences, X, which might be 
inserted into the transmitter filter with T— D*+ D*+1I. These are listed in 
Fig. 4. The sixteen Z sequences are mutually separated by a distance of at 
least three, a necessary condition for single-error correction [2]. 

The advantage of the linear circuit viewpoint of this paper is that instead 
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of concerning ourselves with the distance properties of 2*= 2” (in our ex- 
ample, 16) different code message sequences Z = (T)X, we may concentrate 
our attention on the impulse response of the receiver filter with transfer ratio 
T-. It is not claimed that this latter viewpoint will ultimately be more 
advantageous than the first, but only that two viewpoints are better than one. 


be 
N 
Il 
È 
by 


0000:000 > 0000000 
0001000 0001011 
0010100 0 0010110 
0011000 D'OR OI 
0100:000 0101100 
0101000 0100111 
0110000 0111010 
0111000 0110001 
1000000 1011000 
1001000 VAT 
1010000 100,0gl 714150 
Irondil 00.0 1000101 
1100000 "I e ILL 
1101:000 Peon 
1110000 1100010 
1111000 1101001 


Fig. 4. — Coded sequences for single-error correction (n = 7). 


For single-error correction in a block of length n containing b buffer posi- 
tions k= n—b information positions we need only have a receiver filter with 
an impulse response with period of length n with each d successive digits in 
that response different from each other subsequence of length b. This is pos- 
sible for the case n = 2*—1 and the proper polynomial is one of degree b 
which has a maximal-length «null sequence » [1] of 2° —1 digits. Several 
of these are listed in Fig. 5. 


Dis YD. +1 D+ Di+T 
D+ D°+1I D+ D+I 
D'+D*4I DI DUT 


Fig. 5. — Polynomials having maximal length null sequences. 
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4. — Conclusions. 


Extensions of the preceding ideas apply to the situation in which more 
than one error occurs in the channel, and were treated in the paper from 
which this is an excerpt. The main conclusion reached in that paper was that 
instead of thinking about the distance properties of 2° message points in an 
n-dimensional space, we may profitably think of designing a linear binary 
sequence filter at the receiver whose impulse response is of such a form that, 
by viewing b— n—k successive digits of it we distinguish subsequences due 
to single errors, by viewing d digits of two superimposed impulse responses 
we may distinguish sub-sequences due to double errors, etc. 
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1. — Combinational circuits. 


An information-lossless transducer is roughly one for which a knowledge 
of the output sequence of symbols is sufficient for the determination of the 
corresponding input symbols. For a combinational circuit the definition is 
particularly simple since for such circuits the input symbol (or combination 
of symbols) at a given moment uniquely defines the output symbol (or com- 
bination of symbols). Thus a information-lossless combinational circuit is one 
for which no two different input combinations can produce the same output 
combination. For instance, for a combinational circuit with n input and 
n output leads upon which binary signals can appear the usual truth-table 
or table of combinations may be examined easily to determine whether or not 
the requisite one-to-one mapping is represented, and the logical equations 
expressing the output symbol values in terms of the input symbol values may 
be solved for the input symbols in terms of the output symbols. 


x,>| combinational |+>% di ae ag Vy Ye Ys 
Ho > > Yo 
L3—> logic + Yo 

0 0 0 i 1 1 

(a) 0 0 1 0 0 1 

0 1 0 1 0 (0) 

0 Il 1 0 1 1 

Y = 1H 2 + %3+ dite + 2,03 il 0 0 0 0 0 

1 0 1 0 1 0 

Yo = 1+ RG + %+ 03 1 1 0 1 1 0 

|ys=1+%,+ w+ Lilo + Log li Il 1 1 0 1 

(b) (c) 


Fig. 1. — Illustrating a lossless combinational circuit. 
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As an example consider the circuit described in Fig. 1. The equations of 
Fig. 1-b, written in terms of the operations of the logical product and addition 
mod-2, are seen for this case to be non-linear since they contain product terms 
such as #73. 

The rows of entries in the right-hand side of the table of combinations 
of Fig. 1-c are seen to be the same as the rows of entries in the left-hand side, 
although these rows have been rearranged. 

In Fig. 2 are demonstrated the «inverse » table of combinations and solu- 
tions for the a’s in terms of the y’s. A simple test for the solvability of the 
equations in which the y’s are expressed in terms of the w’s is that the expres- 
sions for Y1; Yo: Ys, Vie: Ys, and YsYs should not contain the term 27,7, but 
that this term should be contained in the expansion of the expression for y,YsYs- 


Yi Ya Y3 Vy To Ts 

0 0 0 I 0 0 

0 0 1 0 0 L 

0 1 0 1 0 1 

0 1 1 0 1 1 

1 0 0 0 1 0 1 ++ Yst Vide 

1 0 1 1 1 1 ERP 

i 1 0 1 1 0 2= Yi + Y2Ys 

1 1 1 0 0 0 Lg = Ya + Ya + YY + Yo¥s 
(a) (b) 

Fig. 2. — The description of a cireuit inverse to that of Fig. 1. \ 


2. — Terminal description of sequential circuits. 


For sequential circuits a more detailed definition of information-losslessness ! 
is necessary. We consider sequential circuits for which a knowledge of the 
present state and the circuit input determines the next following state and the 
corresponding output. We will limit ourselves to circuits with a finite number 
of states, for which the outputs are associated with the transitions between 
States and occur in synchronism with the corresponding input symbol which 
produced them, and which have a single binary input and a single binary 
output. (Extension to circuits with an infinite number of states, or with out- 
puts associated with states rather than transitions, or with input and output 
symbols chosen from a larger alphabet will not be made here because they 
would not add to our fundamental understanding.) 

All such finite-state automata may be described by any method which shows 
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the dependence of the next state (8) and of the output (y) upon the present 
State (s) and the input (x). A flow table (as in Fig. 3-a) has the advantage of 
compactness and orderliness of presentation of the necessary data, and a state 
diagram (as in Fig. 3-b) perhaps has the advantage of giving a better feel for 
what sequences of states are possible. The correspondence between these two 
forms may be illustrated by reference to the upper right entries in the flow 


Fig. 3. — A sequential circuit description and realization. 


table and the heavily lined transition of the flow graph. Each of these is 
interpreted. When the circuit is in state s, and if the input symbol is #= 1, 
the resulting output. symbol is y= 0 and the next state is s,. One pos- 
sible circuit which has the terminal characteristics of Fig. 3-a,b is shown 
in Fig. 3-c. 


The dependence of (s) and y upon s may be displayed in the block diagram 
synthesis (not treated here) is to de- x 

termine for an arbitrary state diagram 

are necessary and what specific functions 0 | 

should be incorporated in the combina- pig. 4. — General form of a sequential 


form of Fig. 4. The general problem of 
Combinational a 
logic 
or flow table how many feedback loops ei 
tional logic. circuit. 


3. — Definition of information quantities in sequential circuits. 


The information quantities which we are going to use here are related to 
the knowledge that an observer of the circuit has when he has a knowledge 
of the describing flow table and of the sequence of output symbols, but no 
direct knowledge of its input symbols or of its internal states (*). 


(*) These quantities are defined more precisely and illustrated more fully in Infor- 
mation Conservation and Sequence Transducers, in Proceedings of the Symposium on 
Information Networks, pp. 291-307, Polytechnic Institute of Brooklyn, April, 1954. 
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Input information is related to the output observer’s expectation of a given 
input symbol. If, for example, the binary input symbols are equally likely 
and independent of each other the input information rate is at all times one 
bit per symbol. 

Output information is related to the output observer’s expectation of a given 
output symbol. In the circuit of Fig. 3 this observer knows that the state s, 
can be followed only by transitions which yeld the output y= 0. Therefore 
when he knows that the state of the circuit is s, and observes that the output 
isy=0 the corresponding output information is zero. 

Information is stored when, from the output observations only, it becomes 
impossible to tell exactly what the state of the circuit is. If, for example, an 
observer calculates (as he might if given the data in the paragraph above) 
that the circuit is in the state s, or state s, with equal probability, then for 
him the circuit has stored one bit of information. Note that the quantity 
of information stored in this sense may be arbitrarily large even for a circuit 
with only two states and a correspondingly simple realization if only the input 
symbols are unexpected enough to the observer. 

Information is lost when change of internal state takes place so as to eli- 
minate (wholly or partially) data about the past history of the circuit input. 
Its measure is related to the probability that the actual input symbol sequence 
was responsible for the observed output sequence rather than any of the other 
possible input sequences. For example if the output observer knows that the 
initial state of our circuit is s, and then sees two zeros in succession as output 
symbols then, for him, information is lost even if the final state of the circuit 


(a) (b) 
Fig. 5. — Illustration of conditions for information loss. 


is now revealed to be s,, since the corresponding input sequence could have 
been either 0,1 or 1,0 and no further analysis of the output data preceding the 
initial state of sj nor the output data following the final state of s, will be of 
any avail in determining which input sequence actually occurred (see Fig. 5-a). 
If these sequences were equally likely one bit has been lost. 

It may be proved that with those definitions of information quantities the 


following information conservation is valid for each step in an indefinitely 
long sequence of observations: 


Lo =I output + CA lost = AI 


stored * 
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4. — Definition of information-lossless automata. 


It is clear from the preceeding discussion that information loss occurs in 
a circuit when two or more input sequences may lead to the same output 
sequence because then the input sequence cannot be uniquely determined if 
only the output sequence is known. More exactly, a sequential circuit (even 
one with an infinite number of states) is defined as lossless if and only if there 
exist no two (not necessarily different) states s, and s,, and no two different 
equal-length input sequences {x,} and {#,} and no output sequence {yo}, such that 
both {x,} and {x} can lead from s, to s, and yield {yo}. This is of course equi- 
valent to saying that a circuit is lossless if and only if, for an indefinitely long 
experiment in which the initial and final states and the output sequence are 
known, the input sequence may be uniquely determined. 


5. — Class I information-lossless automata. 


The clerical procedures for the determination of losslessness may be orga- 
nized rather neatly. Consider the flow table of Fig. 6-a and the derived table 
of Fig. 6-b. The first row of this derived table tells us that if the initial state 
of the circuit is s, the next state may be deduced to be either s, or s; imme- 
diately upon determination of the output symbol as y= 0 or y=1, respec- 
tively. The other rows have similar interpretations. Clearly the example before 
us is a special case for which an input symbol always produces an immediate 
(mod-2) effect upon the output and is characterized by the fact that each of 


DENT IRE yO) Vaal 
Sy RCA Se (0) 8, Bre el) 8, : (0) 
85 STRO, Sel 89 8, : (0) Syms) 
83 SL SARA Ba E (00) 8, : (0) 
CA Gg Sa 84 83 : (0) Sen) 


Fig. 6. — Tabular test for losslessness applied to a Class I circuit. 


the two transitions away from any state are associated with the two different 
output symbols. Thus the possibility for « parallel » sequences shown in Fig. 5-b 
does not exist. For such circuits, which will be called Class I circuits, it is 
possible to derive inverse circuits which when put in cascade with the original 
produce as their output sequence an exact replica of the input sequence of 
the original. The terminal specifications for these inverse circuits are easily 


1213 


26 - Supplemento al Nuovo Cimento. 


402 D. A. HUFFMAN 


had by completing the table illustrated in Fig. 6-b with entries (paranthesized) 
telling what x-symbol should be associated with a given transition. 

A block diagram showing one possible realization of a Class I circuit is 
shown in Fig. 7-a and one possible realization of its inverse is shown in Fig. 7-b. 
These two circuits differ only in the connections made to the mod-2 adder 
gate, and therefore we may conclude that the inverse to a Class I circuit may 
be realized in a circuit 
having the same num- 
ber of states as the ori- 
ginal. Moreover, since 
the circuits have a 
reciprocal relationship 
either may be used as 
the canonic form of a 


Fig. 7. — Canonic forms for a Class I circuit. Class I circuit. 


6. — Class II information-lossless automata. 


Another case of an information-lossless circuit is shown in Fig. 8. The 
upper four rows of the table in Fig. 8-b are derived in a manner similar to that 
used for our previous example, except that now the knowledge of an output 
symbol does not lead immediately to a knowledge of the input symbol which 
produced it. For example, the second row of the derived table is to be inter- 
preted as follows: If the initial state of the original circuit is s, then the symbol 
y= 0 must necessarily follow and as a result we are now not certain whether 
the following state is s, or s,; (or whether the input symbol was #=0 or LE 


e% 


y = 0 YU 
er 89 8s 
S2 813 == j 
83 A 814 
DA Sy 8 
“i st “13 “a 8134 
x=0 PSR | S14 So4 893 
893 813 814 
81 So . 0 83 n 1 804 S134 8, 
82 8:0 85 : 0 8154 Boa dicot 
83 821 8: 1 ei 
84 8 : 1 S1:0 Bite bee: sca 
(a) (b) 


Fig. 8. — Tabular test for losslessness applied to a Class II circuit. 
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The first four rows of the new table indicate that confusion may exist 
between states s, and s, or between s, and sj. The two symbols s,, and s,, are 
entered as designators for rows which are added to the first four rows of the 
table. Entries for these new rows are found by adding subscripts found in 
the corresponding entries found in the rows specified by the subscripts of the 
designator of the new row. For instance, the entry in the y=1 column for 
the row headed s,, is $,4 since the entries found in the rows headed s, and sg 
were s, and s,,. The newly derived entry tells us that if we are uncertain 
as to whether the state of the circuit is sj or s, and if an output symbol y —1 
is observed, our new uncertainty is among s,, s; and s,. The process of gene- 
ration of new rows is repeated as long as is necessary. Ultimately the neces- 
sity for new rows is ended and the table is complete. If in the process of 
adding subscripts from «component» rows to find the subscripts for « com- 
posite » rows no situation is found in which the same subscript is found in 
each of the component rows, the circuit being tested is information-lossless. 
Our present example is one of this type. 

It could have been seen directly from Fig. 8-a that the flow table described 
a lossless circuit, since two and only two transitions lead to each state and 
each of these transitions is associated with a different output symbol. We 
will call such a circuit a Class II circuit. Thus there is no possibility for 
« parallel » sequences shown in Fig. 5-b. Further, a knowledge of the final 
state of the circuit and the last output symbol is enough for the determi- 
nation of the next-to-final circuit state. Thus the input sequence for a finite 
experiment on a Class II circuit may be determined from a knowledge of the 
final state and the output sequence, just as the input sequence for a finite 
experiment on a Class I circuit may be 
determined from a knowledge of the initial 
state and the output sequence. RE 

Since, for a Class II circuit, knowledge ‘| > Logic 
of a state and the output symbol for the fine ns 
transition leading to that state is suffi- 
cient for the determination of the pre- Fig. 9, — Canonic form for a Class II 
ceding state and this input symbol, this is circuit. 
equivalent to saying that the combinational 
logic of the general block diagram of Fig. 4 is, for Class II circuits, lossless 
in the sense of Section 1 (see Fig. 9). 


lossless 


7. More general information-lossless circuits. 


It seems to the author that both Class I and Class II circuits deserve to 
be called information-lossless, the first since an inverse circuit can always be 
specified and the second both because a specific decoding procedure can be 
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described once the final state of an experiment is given, and because of the 
conceptually satisfying result that a lossless combinational circuit in which 
some outputs are reintroduced as inputs after a unit delay is also lossless in 
the wider sense we have used in this paper to apply to sequential circuits. 
Tt is only fair to point out to the reader that some other more restricted defi- 
nitions of terms similar to information-losslessness as used in this paper have 
been used and probably will continue to be used by others. 

There are many circuits which are lossless which are neither purely Class I 
or purely Class II circuits. For all of these circuits the test illustrated in 
Fig. 8-b is valid, but the circuit cannot be synthesized in either of the canonie 

forms already given. In- 

stead the more general 

canonie form shown in 

Fig. 10 may be shown 

(by specifying a rather 

involved synthesis proce- 

® dure) to be appropriate. 

It is easy to see that the 

Fig. 10. — General canonie form into which all canonic forms of Fig. 7-@ 

information-lossless finite automata and Fig. 9 are special 

may be synthesized. cases of the diagram in 

Fig. 10. The meaning of 

the block labelled « lossless logic » with feedback signals (8) acting as « con- 

trol» is that, for any set of values of the f-signals, the values of @ and « 
may be determined from a knowledge of the values of y and (x). 

An interesting result which was not apparent until the canonic form of. 
Fig. 10 was derived is that the total internal state at both the beginning and 
at the end of an experiment need not be known. Instead, information about 
the initial state of the feedback loops around which the B-signals flow and 
about the final state of the feedback loops around which the «-signals flow 
(along with a knowledge of the sequence of y-symbols) is sufficient for deter- 
mination of the sequence of input symbols. The actual decoding procedure 
involves the determination of the entire sequence of B-values from the given 
data about the initial f-value and the sequence of y-values, and next an 
iterated determination of the sequences of «-values and #-values from the 
final «-value and the now-known sequence of B-values. 

Also of interest is the fact that no information is ever stored in the right- 
hand portion of the circuit since knowledge about the initial value of the 
f-signals and about the sequence of y-Symbols gives us an always up-to-date 
exact knowledge of what the signals are in the B-feedback loop(s). All the 


stired information in the circuit is associated with the block marked «lossless 
logic » and the adjacent feedback loop(s). 


logic 


(control) 


arbitrary 


logic 
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Finally, to serve a warning to those who still somehow feel that « infor- 
mation » is associated with the symbols within a finite-state circuit on a sort 
of every-symbol-carries-its-own- 
information basis we show the 
circuit of Fig. 11, which has no 
feedback loops, but in which, 
nevertheless, information neces- 
sary for the determination of 
some input symbol may be stored 
for an arbitrary large number of 
steps of an experiment. Of course, then, no finite «inverse » exists even if we 
agree that the «inverse» need not regenerate the w-sequence immediately, 
but only after some long but fixed delay. The details of analysis of this cir- 
cuit are left to the interested reader. 


Fig. 11. — An interesting finite state circuit. 
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I should like to discuss some results obtained by my colleagues and myself 
in trying to devise, by analysis of suitable statistical models, an operating 
communication system for multipath conditions. This work has already been 
described in great detail elsewhere (PRICE and GREEN [1]); here I shall sum- 
marize it for those to whom these ideas are new. 

The problem is somehow to devise a system for sending binary information 
through a channel perturbed by randomly varying multipath and additive 
gaussian noise. By multipath we shall mean a condition in which there is 
more than one propagation path from transmitter to receiver, and we shall 
begin by postulating a suitable model for this often troublesome phenomenon. 
For the practical cases of interest we can say that 


1) The maximum difference in the travel times of the transmitted signal 
along the different propagation paths is some number 7,. The 
strengths of paths outside this range T,, are to be considered negligible. 


2) The times-of-flight and strengths of the individual paths are random 


variables with some upper limit R on fluctuation rate to be defined 
shortly. 


3) The propagating medium is linear. 


The particular communication environment that has been of interest in 
this work has been the so-called high frequeney band ((3--30) MHz) in which 
multipath has always been a severe problem. We are interested in sending 


binary information where a duration T, is allotted to the transmission of each 
successive binary digit. 


We will be interested in situations in which 


(1) TRE TENUE 


1218 


APPLICATION OF STATISTICAL NOTIONS TO MULTIPATH CHANNELS 407 


and the results described here will be applicable to any such situation. In 
the high frequency problem 7, is usually less than five milliseconds, 7, is 
several tens of milliseconds and R is of the order of 3 to 3 Hz. 

The three assumptions just given allow one to! define a time-varying 
linear filter as a model 


for the propagating me- B(w,t) 

ium (Fig. 1-4): The 

response to a unit im- MULTIPATH 

pulse is h(x) typically but È 
not necessarily represent- H(w.t) n 

able as a series of im- (A) (8) 


pulses as in Fig. 1-B 

changing slowly with real 1 

time t. Fig. 1-B, shows aes Fans ce —èl 
Medi) tori an pare (ef 
ticular value of t. The 
time-varying complex fre- 
quency response H(w) at 
a time t is defined as the 
Fourier transform of that 
function h(t) occurring at re? 
time t. Under a fourth Fig. I. 
assumption, namely that 


DELAY LINE 


n= Ty W 


4) The communication is confined to a frequency band W eycles per 
second in width with W smaller than the center frequency of the band. 


The equivalent filter of Fig. 1-A can be redrawn in the form shown in 
Fig. 1-0. This is done by using a form of the sampling theorem stating that the 
function 4'(7), the inverse Fourier transform of that portion of H(w) lying in W, 
can be represented uniquely in terms of 277, W suitable samples. The particular 
samples chosen are values of amplitude and phase of h'(t) taken at values of 
delay spaced 1/W second apart. These amplitudes and phases «, and y; are slowly 
varying funetions of time t and are represented as the gains and phase shifts 
of amplifiers whose inputs are the outputs of a tapped delay line, and whose 
outputs are all added together. We arbitrarily define the parameter PR, as 
the reciprocal of the smallest time during which any «; of ©; changes by say 
10 percent or 10 degrees respectively. 

Once we have defined what we mean by a multipath channel, and have 
set down a model to represent it, it is possible to proceed in several directions 
in deriving an optimum communication system to work through it. The deri- 
vation to be presented now leads to a form of optimum receiver which has 
been called a «rake» receiver. One can derive the same configuration in 
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different ways. However, in each derivation that we have been able to devise 
there is at least one step that must be made in a somewhat heuristic way. 
No one analysis is completely self-contained. 
To deduce the form of optimum receiver for both additive noise and multi- 
path present it is convenient to start with the well-known optimum receiver for 
additive noise alone, and 
then generalize it to in- 
clude multipath as well. 
| Fig. 2 depicts this sim- 
pler situation. At the left 
is a transmitter sending 
one of two waveforms 
X(T) or a(t) to represent 
symbols 0 or 1. Each wa- 
veform «,(t) and (a#,)t is 
assumed to have negligi- 
ble energy outside a time 


NOISE È 
nt) | interval 7, seconds long. 
TRNSMITTER CHANNEL i RECEIVER A succession of these sig- 


nals is sent and in the 
channels corrupted by 
the addition of a station- 
ary noise function n(t) having a gaussian amplitude distribution and a constant 
spectral power density N, W/Hz over the frequency region of interest. The 
optimum receiver is defined as one which will deliver at its output a sue- 
cession of 0 or 1 symbols which differ from the input sequence as infrequently 
as possible. 
It can be shown that, having observed the received signal 


Fig. 2. 


RETGERTO) 
(2) y(t) = 
x(t) + n(r) 


but not knowing which æ was added to n, the a posteriori probability that, 
given y(t) a zero was transmitted is 


(3) POM) = F,( (tot) — yore dr) 
and similarly > 
(4) P(1|y(7)) = F, | I rate ee ar] 


0 
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whe F,; and 7, are monotonic decreasing functions of their arguments. Errors 
will be as infrequent as it is possible to make them if these two probabilities 
can be computed and a decision made in favor of 0 if P(0|y(t)) > P(1|y(t)) 
and in favor of 1 if vice versa. Æ, and F, depend on the a priori probabilities 
P(0) and P(1) and if these and the waveforms (7) and «,(t) are known at 
the receiving end of the system, all the data necessary to compute P(0|y(t)) 
and P(1|y(t)) is at hand, and an optimum receiver can be constructed. The 
receiver shown in Fig. 2 is such a receiver for the common case in which 


(5) P(0) = P(1)=4 


and the transmitter signal energies are equal 


Tp Tp 


(6) Jamar - fama IMI, 


Condition (5) means that F,— F, and because this function is monotonic, 
a decision can be made by comparing just the arguments in equations (3) 
and (4). By expanding the integrands of these arguments and using (6) it 
is seen that a decision should be made according to which cross-correlation 


Tp 


| selva, 


or i 
| tordre. 


0 


is the larger, rather than according to which mean-square difference (equa- 
tions (3) and (4)) is the smaller. 

Having found the optimum receiver we might be tempted to go further 
and discover some optimum choice of transmitter waveforms a(t) and 7,(r), 
under an average power constraints £,= a constant. However, it turns out 
that an optimum choice for x(t) and #,(t) is merely that they be equal and 
opposite. As long as they obey this condition and have energy Æ,, the par- 
ticular choice of waveform does not matter. The probability of error as it: 
turns out, is a function only of Z,/N,. 

Now let us proceed to introduce the multipath condition and see what 
can be said about the optimum receiver and the best choice of signals to use 
with it. 

The multipath condition, representable by the network of Fig. 1-C, is 
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assumed to be in cascade with the transmitted signal before the noise n(T) 
is added, i.e. the filter appears at point P in Fig. 2. The receiver is now no 
longer optimum since it is correlating æ(T) and #,(t) against 


[ a(x) * h(x, t) + n(r) 
(7) y(t) = À 
| a(r) + A(x, D + n(r) 


rather than the y(7) given by equation (2). (The « indicates the convolution 
operation describing the output of a linear filter.) But notice that if the refe- 
rence signals at the receiver were not x,(7) and #,(t), but rather æ(t) * A(t, 4) 
and «,(t) x A(t, t) respec- 
tively, by the arguments 
given previously, the re- 
ceiver would be optimum 
again. And this can clear- 
ly be done by inserting 
filters, identical to that 
representing the multi- 
path channel, in cascade 
with the sources of sig- 
nals a(t) and %;(t), as 
shown in Fig. 3. As the 
succession of signals last- 
RECEIVER ing 7, are transmitted, 


NOISE 
muctipaty NOISE 


TRANSMITTER | CHANNEL 


2 E 


and the multipath condi- | 


Fig. 3. tion (represented by all 

the xs and @,’s) changes 

slowly, the receiver will still be optimum as long as the x;s and @,’s in the 

two filters of the receiver are kept in correspondence with those in the chan- 
nel. (This correspondence is indicated by broken lines in Fig. 3). 

Now the question of the optimum transmitter signals is not so easily an- 
swerable. The probability of error, as before, depends on the ratio of signal 
energy E. to noise spectral density N,, where by signal energy we mean the 
signal at the point where noise is added to it: 


Tp 


(8) E; = [te «h(t, t)P dr, 


0 


and similarly for «,(7). But in a physical System we will want to constrain 


the transmitter power as indicated in equation (6). To minimize the pro- . 


ii 


Dori 
24 


ia. 
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bability of error under this constraint but with the multipath condition pre- 
sent, it can be easily shown from (8) that #,(t), and x;(7) should be sinewaves 
of opposite phase at the frequency in W for which | H(o, t)|? is a maximum. 
For present purposes, however, this solution must be considered to be of aca- 
demic interest only, since it implies some return link from receiver to trans- 
mitter so that the proper wave forms may be mutually agreed upon. Com- 
munication systems with such feedback links are interesting but here we will 
be obliged to confine ourselves arbitrarily to the condition for which the wave- 
forms «(t) and «,(7), once agreed upon, are not altered. The question of 
optimum signal waveforms under this condition has not been solved, and as 
will be seen from subsequent paragraphs the particular choice used by us has 
been based more or less on intuitive reasoning, and represents, at best, a start 
on the problem. 

Returning to Fig. 3, we perceive that in order for the correspondences indi- 
cated by dotted lines to be maintained, the receiver must somehow make 
measurements of the x,’s and @,’s and use them in the correcting filters. Before 
discussing the measurement function we redraw the receiver of Fig. 3 in the 
form given in Fig. 4. To do this we note that in Fig. 3 the input to each 
integrator is obtained by multiplying y(t) by the sum of delayed replicas of 
x(t), each replica having been multiplied by an amplitude and phase angle. 
By doing the multiplication ahead of the weighting and developing variously 
delayed replicas of y(t) rather than of x(t) we have the scheme of Fig. 4. 
That this is legitimate and also has certain practical advantages is shown in 
the parent paper ([1], p. 561). 


RECEIVER 
INPUT 


OUTPUT 
O 


DECISION 


The measurement of h(t, t), that is the «,’s and @,’s, can obviously not be 
done perfectly, because of the noise. However, remembering that the fluc- 
tuation period 1/R of these parameters is much longer than 7, (expression (1)) 
the length of each signalling element, we are tempted to measure the «; and ©, 
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as accurately as we can by observing for 1/R seconds knowing that each 
measurement will be less noisy than the two inputs to the decision operation, 
since the latter are allowed an obser- 

vation interval of only 7,. 
(Copel nen jo ae There are several different ways in 
; which one can measure the impulsive 
response of a filter. A ‘particularly ap- 
propriate one, shown in Fig. 5-A, is the 
well-known artifice of cross-correlating 
input and output. For non-time varying 

filters 


ct) a(t) a(t) (9) H(w) = ®D,,(@)/By,(@) , 


g,4) Gt) @ (t) 


=h(T, t) e 
where D,,(0) and ®,,(w) are the Fourier 


Fig. 5. transforms of the input-output cross- 
correlation function, and input cor- 
relation function, respectively. If ®,,(©) is a constant, we have 


(10) hr 8 DETTES 


As long as h(t, t) and thereby H(w,t) are changing slowly enough (see ex- 
pression (1)) the same method is applicable and we can write approximately 


(11) h(t, t) = K@,,(t, t) © 


Since we are only interested in h'(t, t) the inverse Fourier transform of that 
part of H(w, t) lying in W, we can assume the input to have a non-zero 
spectrum only in W. Thus we can use the device of Fig. 5-B to cross-correlate 
input and output f, and f, and deliver sample values x, and ®, which define 
uniquely the band-limited function h'(t, t). Each integrator provides as its 
output the integral of the past 1/ seconds of its input. A careful distinction 
Should be made at this point between Figs. 1-C and 5-B. The former is an 
equivalent representation of a filter having a band-limited impulse response. 
The latter is a device to measure such a response by cross-correlating the out- 
put and input of such a filter. 

In our communication System the receiver is in possession of the filter 
output f,, albeit in a noisy form. All that can be said about the input is that 
it is either a(t) or x(t). We can still make the measurement if we use as f, 
in Fig. 5-B the mixture %o(T)-+-a,(t) and insure that x(t) and #,(t) are reason- 
ably orthogonal for the integration time 1 /R. Then the measurement outputs 


%, and ©, will be only slightly more noisy than if the actual input sequence 
were known. 
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And now as a final step in deriving the optimum receiver we notice by 
comparing Figs. 4 and 5-B that we can use the delay line and multipliers of 
the former to get the voltages which when integrated for 1/R seconds give 
the approximate values of the «; and ®, shown in Fig. 5-B. This is done in 
Fig. 6 which shows the output of each upper multiplier being added to that 
of the lower to form the mixture feeding the 1/R-second integrator. The out- 
put of each such integrator is applied as the x; and ©, correction to the output 
of the first multiplier. 

To recapitulate: we have taken the known result for an optimum receiver 
in the presence of white gaussian noise and modified it to include the case in 
which there is also some known multipath disturbance in cascade in the channel. 
Then we have argued, somewhat intuitively, that if the cascaded filter is changing 
slowly we can make a reasonably error-free measurement of it at the receiver 
and apply this knowledge for continuous and automatic readjustment of the 
filter used in the receiver. The name «rake » derives from the manner in which 
the various correlation detectors are arranged equidistantly along an axis of 
delays so as to detect any signal arriving in the range of delays 7,. 

Recall that we used as a priori information about the multipath condition 
only that the spread in path delays was smaller than some 7,, that the me- 
dium was linear and that in a sampled-type representation of the equivalent 
linear filter, «; and ©; varied more slowly than some rate R<1/T,: This 
has allowed a reasona- 
bly simple explanation 
of the rake receiver and 
its statistically optimum 
properties. 

This form of receiver 
was originally derived ica 
by PRICE[2] in a diffe- INPUT, 
rent way. He used a ae 
considerably more s0- 
phisticated a priori pic- 
ture of the propagating 
medium (specifically the 
delays, the probability 
distributions of ampli- 
tudes and phases of each Fig. 6. 
of the paths and the 
power spectrum of their time variation). From this (for large N) an optimum 
receiver somewhat like that in Fig. 6 was derived more or less directly, with 
the measurement operation already contained in the result, and not introdu- 
ced ad hoc as in the treatment given here. The principal heuristic extension 


4 
OUTPUT 


Pal 
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necessary in Price’s result was to assume that it still holds for reasonable values 
of N, and to provide taps on the receiver delay line spaced by the sampling 
interval (1/W), rather than to leave them placed at the mean delay of each 
of the assumed discrete paths. 

The previous paragraph is by way of historical perspective and points out 
that we have been unable to construct a unified derivation for the rake re- 
ceiver that has not required some supplemental reasoning of a heuristic nature. 

The question of optimum wave shapes for x(t) and x,(Tt) is worth a few more 
remarks. It has been noted that the two should be roughly orthogonal for 
the integration time 1/R in order to permit the measurement function and it 
will be recognized that they should also be orthogonal over the integration 
time 7, as well, so as to minimize the probability of error. We have referred 
repeatedly to a bandwidth W without stating what it should be, and we have 
stated that the spectrum should be flat within W. Using known results on 
the signal-to-noise ratio observed at the output of correlation detectors ([3], 
equation (13)) it is possible to derive the following approximate expression 
for signal to-noise ratio at the decision element input of the receiver of Fig. 6 
as a function of the parameters x; defining the multipath condition 


This expression assumes that the signals are Segments of gaussian noise of 
flat density in W. The first denominator term shows the effect of the additive 
channel noise n(t) whereas the second is a self-noise term which can be re- 
duced by choosing the waveform statistics to be something other than gaussian. 
It is not known exactly how large a reduction in this term is theoretically 
achievable, nor what waveforms to use in achieving it. 

From equation (12) it is seen that the proper choice of W is to make it 
as large as possible. When the first denominator term is smaller than the second 
(large receiver input signal-to-noise ratio), (S/N) is proportional to the coef- 
ficient T,W. When the first term is greater than the second (small input 
signal-to-noise ratio) an increase in W adds more terms to the numerator sum- 
mation. This continues until all paths have been resolved whereupon a further 
Increase of W no longer helps. There is experimental evidence that in the 
high frequency band this condition is not achieved for bandwidths less than 
50 to 100 kHz, and one does not usually have this much bandwidth at his 
disposal. 
= System employing the rake receiver has been built and Subjected to 
limited field tests, using W— 10 kHz, T,=3 ms, T,— 22 msand R (in Fig. 6) 
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—1 Hz The results of these tests and comparisons with conventional systems 
are described in reference [1]. The system compares favorably with the more 
conventional FSK (frequency-shift keying) systems using space-diversity par- 
ticularly at low error rates. This is what would be expected since the wide- 
band signal used in the rake system, and its optimum reception constitutes 
a form of optimum frequency diversity. Its most important practical limi- 
tation (besides equipment complexity) is its use of a wider bandwidth than is 
usually available. However, situations can be imagined in which the extre- 
mely low error rates achievable with the wide-band rake technique would be 
worth the expenditure of bandwidth. 

Perhaps the most serious remaining problem in this work is the one that 
I have alluded to repeatedly —the study of more suitable waveforms. The 
size of the second denominator of equation (12) has been observed to be a 
limiting factor in system performance. Until this problem is solved the full 
potentialities of the system will not be realized. 
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1. — Introduction. 


An information system is often represented schematically as a cascade of 
an information source, an encoder, a channel perturbed by noise, a decoder 
and an information sink (Fig. 1). The noise is usually considered to be intro- 


Information. Information 
source sink 


Fig. 1. — Schematic of information system. 


duced in the channel, whereas the processes of encoding and decoding are 
taken to be free of noise. Such an assumption is often equivalent to the sup- 
position that the actual physical transmission represented by the channel, 
is followed by an amplification with a high gain so that any additional noise intro- 
duced in the physical decoder is negligible compared to the signal level in the 
decoder. Since such amplifiers introduce noise of their own, which may be 
comparable to the signal level, they must be considered to be part of the 
channel in the schematic of the information system of Fig. 1. 

We shall be concerned with the optimization of the two terminal-pair 
amplifiers that have to be employed in order to raise the power level of the 
signal before its entry into the decoder. In optimizing the operation of these 
amplifiers we shall make use of the measure of noise performance commonly 
employed in engineering practice, the noise figure. The noise figure is defined 
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as the quotient of the signal-to-noise ratio at the input of the amplifier to 
that at the output of the amplifier. Both signal power and noise power are 
assumed to be contained within a band of frequencies so narrow that the 
amplifier characteristics may be considered to be constant over the band. 
In other words the signal to noise ratios are obtained as ratios of spectral 
densities. The noise-input is assumed to be thermal noise corresponding to 
standard noise temperature, 7,— 290 °K. The use of the noise figure to cha- 
racterize amplifier noise performance restricts the problem to a study of single 
frequency noise performance. 
An alternate way of expressing the noise figure F is 


N 


(1) cio 


where N is the available output power of the amplifier within a narrow fre- 
quency interval Af, N, is the available output power that would exist if the 
amplifier were noise free. 

In the noise figure definition it is implied that the internal noise of the 
amplifier is additive to the signal passing through the amplifier. We shall 
assume that this condition of linearity is satisfied throughout the analysis. 

We are here concerned with the study of the basic limits of two terminal- 
pair amplifier noise performance as measured by the noise figure at high gain. 
The restriction to high gain is a necessary one, since otherwise the problem is 
not defined. Indeed, the noise figure of any amplifier can be reduced to unity 
at a complete sacrifice of gain by short-circuiting the input terminals to 
the output terminals. 

With the recognition that amplifiers, basically, provide «gain building 
blocks » of which it is desired that they add as little as possible to the system 
noise we shall use the following criterion for the evaluation of quality of am- 
plifier noise performance : 

Suppose that n different types of amplifiers are compared. An unlimited 
number of amplifiers of each type is presumed to be available. A general 
lossless (possibly non-reciprocal) interconnection of an arbitrary number of 
amplifiers of each type is then envisioned, with terminals so arranged that in 
each case an over-all two terminal-pair network is achieved. For each amplifier 
type, both the lossless interconnecting network and the number of amplifiers 
are varied in all possible ways to produce two conditions simultaneously: 


a) a very high available gain (approaching infinity) for the over-all two 
terminal pair system when driven from a source having a positive 
internal impedance; and 


b) an absolute minimum noise figure F,,, for the resulting high-gain 
system. 
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The value of (F,,,,—1) for the resulting high-gain two terminal-pair network 
is taken specifically as n «measure of quality » of the amplifier type in each case. 
The «best » amplifier type will be the one yielding the smallest value of (Fin — 1) 
at very high gain. 

The network interconnection envisaged with each amplifier type is shown 
in Fig. 2. Lossless interconnections are used since only such interconnections 
do not add to system noise and therefore 
such interconnections should be able to 
achieve optimum noise performance. (A 
proof has been presented [1] to the effect 
that interconnections with loss cannot 


imput 9 : Amplifier 


4 
1 


he lead to a noise performance better than 
Sees that achievable with lossless intercon- 

Lossless : nections). 
hic Most of the detailed proofs mentioned 


here are published elsewhere [1-3]. Here 
we shall concentrate on three major ideas, 
Amplifier which help towards an understanding of 
the limits on amplifier noisepe rformance. 


1) Every two terminal-pair linear 
noisy network possesses at most two inva- 
Fig. 2. — Lossless interconnection of riants with regard to lossless network 

amplifiers. transformations performed on the network 
which leave the number of terminal pairs 
of the network unchanged. The most general such transformation is shown 
in Fig. 3. These two invariants are characteristic of the internal noise of — 
the network and of the ability of 
the network to deliver or absorb 
power. They represent a conve- 
nient summary of all the pro- 
perties of the network that 
remain unaffected by lossless 
transformations. 


Lossless 
Imbedding 
Network 


ers 
Burt. be 
RTE 7 


2) The two basic invariants 
are related to the optimum noise Fig. 3. — Lossless transformation or «imbed- 
performance achievable with a ding » of two terminal-pair amplifier. 
two terminal-pair amplifier. This 
relation establishes two facts: a) every amplifier possesses a basic limit of 
its noise performance and b) the limit of this noise performance has invariant 
characteristics (as one would expect from a basic quantity). 
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3) In some physical cases studied so far there is a connection between 
the network-theoretical limitation of the noise performance of an amplifier 
on one hand, and the physical gain mechanisms and noise processes on the 
other hand. The newly developed maser amplifier will serve as an illustration. 


2. — Canonical form of linear noisy two terminal-pair network. 


A linear two terminal-pair network with internal sources is conveniently 
characterized in terms of the impedance matrix relation (*) 


(2) V+ZI-E 


Here V is a column matrix of 2nd order comprising the two terminal voltages 
of the network, I comprises the two currents, Z is a square matrix of 2nd order, 
and E comprises the two open-circuit noise voltages of the network. In the 
study of noise the complex amplitudes of the voltages are meaningless by 
themselves and only self and cross-spectral densities have physical significance. 
They are conveniently summarized in a square matrix of 2nd order which is 


E,Et E,Ei 


(3) PE 


ERB, EE: 


Here the dagger * indicates the operation of taking the complex conjugate 
transpose of a matrix. The complex amplitudes £ are supposed to be RMS. 
A linear noisy two terminal-pair network is completely characterized at any 
particular frequency by the two matrices Z and BE‘. 
Tf a lossless network transformation is performed on the network such as 
shown in Fig. 3, henceforth called an «imbedding », a new network is obtained 
with a new impedance matrix and a new correlation matrix EE*. This shows 
that the eight parameters (six of which are complex) characterizing a par- 
ticular two terminal-pair network may be affected by a lossless transformation. 
One may suspect that at least some features of these eight parameters ought 
to be preserved in such a transformation. In other words, one would expect 
‘that every network possesses a certain set of invariants with regard to loss- 
less transformations. 
In references [1] and [2], R. B. ADLER and the author indeed found that 
every two terminal-pair network possesses all in all two invariants (one of 


(‘) We concentrate on two terminal-pair networks since these are used as amplifiers. 
‘The proofs have actually been carried out for » terminal pair networks. 
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which may assume the trivial value of zero). These two invariants are the 
eigenvalues of the 2nd order characteristic noise matrix defined by 


(4) N=—1(2+ 2) EE". 


The characteristic noise matrix (4) contains two significant features of a 
linear noisy network. First, there is the positive definite correlation matrix 
EE‘ which describes the noise within the network. Secondly, it contains the 
inverse of the matrix (Z+Z*) which characterizes the ability of the network 
to generate or absorb power. Indeed, in the absence of noise, the power P 
entering the network is given by 
IZ+ ZL. 


~ 


role 


(5) P= 


In the classification of networks three cases have to be distinguished: 


a) The network is passive, Z+Z* is positive definite. 


b) The network is incapable of power absorption and generates power 
under any arbitrary adjustment of the terminal currents. The matrix Z+Z* 
is negative-definite. This is the case of a negative resistance network. 


c) The matrix Z-+Z* is indefinite. The network can either generate or 
absorb power depending upon the adjustment of the terminal currents. 


The signs of the eigenvalues of N in Eq. (4) can now be determined from 
the fact that the signature of N is controlled by the signature of Z-+Z*, since 
the correlation matrix is positive definite. For the three cases distinguished 
above, we have 


a) both eigenvalues are negative, 
b) both eigenvalues are positive, 


c) the two eigenvalues are of opposite sign. 


has 2 invariants with respect to lossless transfor- 
mations suggests that there should existat least one 
lossless transformation for every particular two ter- 

; | minal-pair network that reduces the network into 
DOO SR] a form which places the two invariants into direct 
work. The signs of the evidence. This « canonical » form should not contain 
resistances are for case More than two parameters representing the two 
(a) ++3(b)——;(e)+—. invariants of the network. A proof to that effect 


1 
| 
| 
| 
| 
| 
The fact that every two terminal-pair network 
| 
| 
| 
J 


— ee oe —r —- — 
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has been carried out [2] and the resulting canonical form of the network is 
represented in Fig. 4. 

The proof is summarized in the following theorem: At any particular 
frequency, every two terminal-pair network can be reduced by lossless im- 
bedding into a canonical form consisting of two separate (possibly negative) 
resistances in series with uncorrelated noise voltage generators VE Di 

The two values of the |#;|? are related to the two eigenvalues 4; of the 
characteristic noise matrix N by the formula 


(6) \Hj2=+ À;, (251,08 


where the — sign applies to noise voltages pertaining to a positive resistance, 
the + sign to those pertaining to a negative resistance. The sign of the two 
resistances that appear in the canonical form of a network are uniquely de- 
termined by the impedance matrix of the original network and a) are both 
positive for a passive network, b) are both negative for a negative resistance 
network, c) one is positive, and the other is negative, for a network with an 
indefinite Z+Z* matrix. 

Equation (6) can be checked easily by evaluation of the characteristic noise 
matrix for the canonical network. In this case N is diagonal. 

In connection with the above theorem it is worth noting that the eigen- 
values of N for a passive network at thermal equilibrium with the equilibrium 
temperature T are all identical and have the value — kT Af. This statement 
is rather obvious in connection with Eq. (6) and the fact that each positive 
resistance of the canonical form must have an available power of kT Af 
according to the Nyquist formula. 

A canonical form is particularly convenient in summarizing the essential 
unalterable characteristics of a linear noisy network. The canonical form of 
the amplifier is also helpful to recognize the network theoretical limitations 
on the noise performance of a linear amplifier. 


3. — Amplification as a coupling to a negative resistance. 


The fact that every two terminal-pair network possesses a canonical form 
as shown in Fig. 4 may now be used to obtain an understanding of the reasons 
for the existence of a basic limit on amplifier noise performance. Thus, con- 
sider the problem of noise figure optimization as originally stated and illus- 
trated in Fig. 2. The imbedding network is assumed to be the general net- 
work that leads to optimum noise performance. Every amplifier can be re- 
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presented by its canonical form imbedded in the lossless network which is 
inverse to that needed to reduce the original amplifier into canonical form. 
The lossless imbedding networks may be considered to be part of the general 
imbedding network of Fig. 2. One then 
obtains for Fig. 2 the new Fig. 5. For defi- 
na niteness we have assumed that our ampli- 

i fiers are all identical, with Z+ Z' indefinite. 
A ° We shall adhere to this assumption through- 
Imbedding cons) 


output Pah Re a out and only state at the end of the 


input 


12 discussion how our arguments have to be 

ei modified in order to take into account 

° L--! the class of amplifiers with Z+ Z* negative 
Sr gi definite. 

10! Next, we shall demonstrate that the 

ET most common conventional amplifier, with 


the equivalent circuit shown in Fig. 6, may 
be considered to be composed of a negative 
resistance, a positive resistance, and a 
lossless nonreciprocal device, a circulator. 
The negative resistance provides the gain, the positive resistance accounts 
for the dissipative behavior of the network of Fig. 6 when excited from 
terminal-pair 3. The circulator is a lossless non-reciprocal network that trans- 
mits waves incident on any one of 
its four terminal-pairs as shown in 
Fig. 7. The transmission lines 1a ja. LAI I 
connected to the four terminal- v } ty | 
pairs are assumed to have one : 
ohm characteristic impedance. If 
a wave is incident on transmission 
line (1) of the circulator and if 
all other three terminal-pairs are matched with resistances of one ohm, the 
wave is transmitted without loss to terminals (2). On the other hand, if a 
wave is incident from terminals (2) with the remaining three terminal-pairs 
matched, the wave is transmitted without loss to terminals (3), and so forth. 
We connect a negative resistance of — 1 Q through an ideal transformer 
of turns-ratio m:1 to terminals (2) of Fig. 7. The resistance seen on the 
secondary of the transformer is then 


Fig. 5. Alternative representation of 
lossless interconnection of amplifiers. 


Fig. 6. — Unilateral amplifier. 


kR=—m. 


On terminals (4) we connect a positive resistance of +1Q. Terminals (1) 
are used as the input of the amplifier, terminals (3) as the output. We shall 


+ 
mm 
SI 


NETWORK THEORETICAL AND PHYSICAL LIMITATIONS ETC. 423 


now inspect the nature of the equivalent circuit of the resulting amplifier dis- 
regarding the internal noise for the time being. 


-182 


Fig. 7. — Circulator with positive and negative resistance connected 


Suppose a wave is incident from the input on terminals (1). This wave is 
transmitted to the negative resistance connected to terminals (2) and reflected 
with a corresponding increase of power. 


(7) Gea Lore lu, 
where 

ite SÌ 
(8) tu RE 


I’ is the reflection coefficient of the resistance À, a quantity of magnitude 
greater than unity since À is negative. The wave a, appears without loss at 
terminals 3. We thus have 


(9) bead ay 3 forage :0 
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and since no reflected wave appears at terminals 1 we have 

(10) b= De for. a, = .0.. 

Now applying another boundary condition to the device by setting os #0 
and a,= 0, we find that the wave incident upon terminals (3) is transmitted 
directly into terminals (4). We thus have 

(11) bi =0 for ar=0 . 


Comprising Egs. (9) to (11) into a single scattering matrix relationship we find 


bid 
De 0 


and thus see that the network has the scattering matrix 


GO 
(12) ce il 
The power gain of the amplifier is 
by |? free 
(13) G È FA | RII 


The equivalent circuit of the network with the scattering matrix (12) is that 
shown in Fig. 6 and is identical with the equivalent circuit of a conventional 


unilateral amplifier as represented by an ideal triode with finite grid resistance. 
For w of Fig. 6 we have 


R—1 
ee Da 
(14) pre. 


The construction employed here demonstrates the nature of amplification 
in a conventional unilateral amplifier with the equivalent circuit of Fig. 6. 
Amplification is obtained by coupling the input excitation into a negative 
resistance. The positive resistance connected to the ideal circulator only serves 
to isolate the input and the output of the amplifier by absorbing any wave 
incident into the output terminals of the amplifier. 

This picture of the gain mechanism is very useful in gaining an under- 
Standing of the basic limit on noise performance as proved in references [1-3]. 
Indeed, returning to the equivalent circuit of Fig. 5 we note that the 
obtaining of gain from the resulting two terminal-pair network depends upon 
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our ability to couple to at least one of the negative resistances on the right 
hand side of the circuit. In coupling to this resistance one has to couple to 
the internal noise of the resistance as well. Since all negative resistances 
and their noise sources are identical, one may couple to any number as 
well as to a single one of the entire set. If it is desired to obtain an 
amplifier with the equivalent circuit of Fig. 6 it is also necessary to couple to 
at least one of the positive resistances in order to obtain absorption of the 
power reflected back in the output of the amplifier. 

This reasoning leads one to suspect that the network of Fig. 7 is one of 
the physical forms of the imbedding network of Fig. 5 which realizes the 
optimum noise performance for the amplifier interconnection. In this form, 
terminals marked « input » and « output » in Fig. 5 are taken to correspond 
to terminals (1) and (3) of the circulator. Terminals (2) and (4) of the circu- 
lator are any other two of the terminal-pairs of the «imbedding network », 
one connected to positive resistance the other to a negative resistance. That 
this particular network form indeed realizes the optimum noise performance 
will be confirmed by direct evaluation. For this purpose, it is helpful to intro- 
duce a new measure of noise performance which preserves its significance even 
at low amplifier gain whereas the noise figure serves as a noise performance 
criterion only at high gains. 


4, — A quantitative measure of amplifier noise performance for amplifiers of 
low gain. 


We have accepted as the measure of quality of noise performance the mi- 
nimum noise figure at high gain that can be achieved by a lossless inter- 
connection of amplifiers of a given type. In the detailed study reported else- 
where it was found helpful to introduce an auxiliary measure of noise per- 
formance. 

The measure of noise performance is 

F—1 
(14a) e 1 — (1/0) ; 
Here, G is the available gain of the two terminal-pair network (for a network 
with loss G is less than unity). Certain modifications in the above definition 
are necessary when the source impedance used, or the output impedance of 
the amplifier, have negative real parts. Here we shall not be concerned with 
such cases. 

It is clear from the definition of M that for high gain (@ + oo) it reduces 
to the excess noise figure. In this limiting case, M serves directly as the 
measure of noise performance previously accepted. At small gains it can be 
shown that M has a significance of its own so that it can also be accepted as 


~ 
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an appropriate measure of noise performance. M has the following interesting 
properties : 
1) In a cascade of two amplifiers with different noise measures the am- 


plifier with the least noise measure (and not necessarily noise figure) should 
be used as the first stage in order to obtain the least overall noise figure. 


2) A cascade of amplifiers with identical noise measures (but not neces- 
sarily the same gain and noise figures) leads to an amplifier with the same 
noise measure. 


3) The smallest value of the noise measure achievable by lossless or 
passive transformations performed on this amplifier is given by 
Ay 
Mo = 
where 4, is the least positive eigenvalue of the characteristic noise matrix (4). 
Thus M has all the properties one would require from a fundamental 
measure of amplifier noise performance. 


5. — Realization of the optimum noise performance. 


We shall now illustrate the realization of the lower limit on noise perfor- 
mance with the aid of the circulator scheme discussed in Sect. 3. We 
connect the negative resistance of the canonical form of the amplifier belonging 
to class c) to terminals (2) of the circulator through an ideal transformer of 
turns ratio m:1, the positive resistance to terminals (4). We have for the 
noise figure 


| bs |? 
si) 


| La, |? 


(16) Ph 


where one takes |a,|}= kT Af. 
In b; of Eq. (16) appears the noise internal to the amplifier. We have 


POY 
17 n= pei ey 
(17) bs Ta +p 


2 


? 


where £' is the voltage seen from the secondary of the ideal transformer, 
E'=mE,. 


The noise measure (14) thus becomes 
(et cp ee LEE ay 
|[|}(1+ R)2 kTA{[1—((1+ R)/1 — R)?| 
OS ee 
ARKT Af 4kT Af  4kT Af’ 
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where 2, is the positive eigenvalue of the characteristic noise matrix of the 
amplifier. Thus, the scheme of Fig. 7 realizes the lowest possible value of M 
at every adjustment of the ideal transformer, in particular for m — 1, i.e. for 
the limit of G— oo (see Eq. (14)). 

In the optimization of noise performance carried out so far it was assumed 
that the amplifiers belonged to class (c), so that the canonical form contained 
both a positive and a negative resistance. The negative resistance was con- 
nected to terminal-pair (2) of the circulator and provided the gain, and the 
positive resistance was connected to terminal-pair (4) in order to provide the 
absorption of a wave incident into the output terminal pair of the amplifier. 
We shall now discuss the modifications necessary in the optimization scheme 
when using amplifiers of class (b) with two negative resistances in their cano- 
nical form. Clearly, when optimizing the noise performance, power gain should 
be obtained by coupling to that of the two resistances which has the smaller 
open-circuit noise voltage. In the scheme employing a circulator in connection 
with one positive and one negative resistance, one uses this negative resistance 
on terminal-pair (2). On terminal-pair (4) one may connect any positive re- 
sistance. The noise of this resistance does not affect the noise figure since 
the noise is all absorbed in the source. The optimum noise measure is again 
given by 


| Ht | 


Mo = <p Ay? 


where |E?|is the open circuit noise voltage of the negative resistance used 
in the amplifier. This particular noise-measure optimization scheme employs 
(essentially) a lossy network and is, therefore, an example of a noise perfor- 
mance optimization with a lossy network. 

It is interesting to note that the noise mesure of a passive network, at the 
equilibrium temperature 7, of its source, is always —1. Indeed, the power 
available at the network output terminals must satisfy the Nyquist formula, 
but is also directly related to the noise figure by definition. One has 


N = FORT, Af = kT, Af 


or 
EPG 
thus 
ie 
(19) naar 1 


For a network at temperature T one has instead 
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6. — Physical limitations of amplifier noise performance. 


We shall now turn to a brief study how the lower limits to amplifier noise 
performance are established physically. The limits are understood in several 
cases, such as the general microwave electron beam amplifier [4], of which 
the conventional triode is a special case, the new parametric amplifiers and 
the maser amplifier. Although the triode is the most common amplifier, we 
shall not study its noise performance here since its study is rather involved. 
Instead, we shall study a somewhat over-simplified version of a two-level 
maser in which the limiting noise performance and thermodynamics are par- 
ticularly intimately connected. 

In a paramagnetic salt with two quantum mechanical energy levels, the 
ratio of the populations of states in the upper and the lower levels at equi- 
librium temperature 7 is given by the well known Boltzmann formula 


Nu hy 
Dali ni x — 
(21) na ae | a 


where hy is the separation between the energy levels measured in terms of 
frequency v, and 7 is the temperature of the salt. This situation is illustrated 
schematically in Fig. 8 where the exponential factor is sketched. 

A salt with the population distribution of Fig. 8 is passive and has net 
absorption of radiation incident upon it. Suppose 
now that the populations in the two energy levels 
are reversed. This can be accomplished in principle 
in a paramagnetic salt by a fast reversal of the time | 
average magnetic field, if the energy separation is 
originally caused by such à field. The spin systems 
then do not change their orientation fast enough but 
remain, for some time, in their original spatial orien- | 
tation. The distribution of states is now 


energy 


Fig. 8. — Populations in 
the two energy levels. (22) hie 


earn hoe 
hi Pip 


The material is now emissive and « presents a negative resistance » to ra- 
diation. 1 is noteworthy that the original temperature of the sample appears 
in the ratio (22) with a reversal of Sign in the exponent. 

If the form of the Boltzmann factor is retained even for the active state 


of the va one may characterize the material by a negative temperature 
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While such a practice is debatable, it is still interesting that this same 
negative temperature plays an important role in determining the basic limit 
on the noise performance of the sample. One can show quite in general that 
the best noise measure achievable with the two-level maser just described is 
given by 

Moi = = =. 


0 


where 7, is the negative temperature appearing in the Boltzman factor cha- 
racterizing the distribution of states of the excited two-level maser. This 
expression should be compared with the expression for the noise measure of 
a passive network, Eq. (20). 

In summary, we may state that the answer for the basic limit on the noise 
performance of a two-level maser was particularly simple since we had to deal 
with a state of inverted thermal equilibrium which still retains many cha- 
racteristics of thermal equilibrium. For thermal equilibrium, however, the 
noise measure defined here acquires a particularly simple value. 
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1. — Introduction. 


The three lectures I shall give in this Course on Information Theory con- 
cern the theory of filtering and prediction that was originated by N. WIENER 
and published by him in the book Extrapolation, Interpolation and Smoothing 
of Stationary Time Series (New York, 1950). 

In this theory the messages and noise are assumed to be continuous statio- 
nary random processes for which autocorrelation functions exist. If f,(t) is 
a message or a noise, where t represents time, then its autocorrelation function 
is defined as 


a) gu (1) = tim 37 [Flt fat + at, 


in which the time displacement 7 has the range (— co, co). 
» ae For an example of this function 
“ded ||. consider the rectangular wave  fa(t) 
È ga shown in Fig. 1. This wave has two 
Fig. 1. possible values of amplitude, namely, 
+ E and — E. 
We assume that the zero-crossings follow the Poisson distribution 


(2) Pn, t) = mule exp[— kr], 


which gives the probability of finding n zero-crossings in the duration t in terms — 
of the average number of zero crossings per second k. The computation of 
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an autocorrelation function is a fairly long story which we shall not be able 


to tell here. We merely state that the autocorrelation function of the 
Poisson rectangular wave can be shown to have the expression 


(8) Paa(t) = E° exp[— 2k|t]]. 


In a manner similar to (1) we define a crosscorrelation function. Thus 
if f,(t) and f,(t) are two stationary random processes, their crosscorrelation 
function is defined as 


7 


(4) Goo) = Jim pr [alpe + mat. 


Val 


It is important to note that the second subscript in pa(T) corresponds to the 
random process that has been given the positive displacement t. We can 
show that 


(5) Past) = Pra(— T) . 


Since we shall consider filtering and prediction by means of a linear system, 
it is necessarily important at this point to state the relationship among the 
input, output and the characterizing function of the system. It is well known 
that the relationship in question is the convolution integral 


(ce) 


(6) fo(t) = fin fi(t—t) dt, 


— oo 


where f;(t) and f,(¢) are the input and output, respectively, of the linear system, 


and h(t) is the time response of the system to a unit-impulse excitation. 


2. — Formulation of the problem. 


We shall formulate the problem in fairly general terms so that filtering and 
prediction are particular applications of the general theory. Let us consider 
Fig. 2 where A represents a linear system which is characterized by h(t), and 
fi(t) is its input. 

Choosing a particular situation, but not restricting the theory to it, we 
consider the case where the input is the sum of a message f,,(t) and a noise 


f,(t). Thus f;(t)=fn(t)+fs(t). We draw portions of the random processes as 
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shown. To indicate the present time, a peak in the message has been chosen 
for convenience. 

Now, if filtering is our objective, we will state that the desired output of 
the system A, in the ideal situation, although it cannot be achieved perfectly, 
is the original message without the noise as shown in the upper-half of the 


| 
Actual output |: 
h (t) o 


1 tout |; 
Message a output | Filtering 
fin (0) d with lag 
A t | 
i ; aT ae 
Noise | ie h(t) RU Actual output | : + 
f, (t) AG) 
MENT RAT Rey | Filtering 
Ea Desired output | : pi Pape 7 
An | Cll 
Present aio | 
time a 
CR 


Fig. 2. 


right-hand side of the figure. Frequently, in filtering, a lag in the output 
message is not a distortion so that in specifying the desired output we write 


(7) fa(t) =fnt—-a), 


where « > 0 is the lag. However, in the presence of the noise at the input, 
the actual out cannot be without error no matter how we may design the 
linear system. Thus we indicate together with the desired output the actual 
output f,(t) which we shall attempt to make as close as possible to fi(t), based 
upon a chosen criterion, by properly designing the system. 

When a lag in the output message is undesirable, but a lead in it is an. 
advantage, as in control problems, we then specify that the desired output 
is the original input message with a forward displacement in time. This spe- 
cification combines filtering with prediction as we can readily see. The desired 
output is therefore 


(8) fa(t) =" ft + x) ’ 


where x > 0 is the prediction time. We have indicated this in the figure where 
the past is to the left of the present, and the future is to the right of it. 


The actual output f,(t) is shown in a manner similar to the case of filtering 
with lag. 
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As another example, consider pure prediction. By pure prediction we mean 
the prediction of a message in the absence of a disturbance. For this problem 
we have 


(9) fi(t) a Îm(t) 
and 
(10) fa(t) a fm(t + x) ° 


We have give only a few examples for illustrating the idea of a desired 
output in a problem. This idea combines operations such as filtering and pre- 
diction, which are conceived to be entirely unrelated in conventional theory, 
into one single problem. At a later stage we shall discuss a further generali- 
zation of the idea. 

Concerning prediction we should remember that the correlation functions 
upon which this theory is based are functions derived from statistical des- 
criptions of the messages and noise involevd in the problem. The prediction 
is therefore a statistical prediction by means of a linear operation. 

Having specified the input and desired output of a linear system we are 
ready to consider the: performance of the linear system and the method of 
finding that system which gives the best performance. The instantaneous 
error is clearly the difference between the actual output and the desired out- 
put. It is desirable that the measure of error is always positive for any instan- 
taneous error, and is mathematically manageable. Such a measure is the mean- 
square error defined as 


1 Sa 
(11) ns I [fo(t) — fault) at, 


which is simply the mean square of the instantaneous difference between the 
actual output and the desired output. 

We are now interested in the relationship between the mean-square error 
and the correlation functions which characterize the input and the desired 
output. To find the relationship, we write the convolution integral (6) for 
fo(t) in (11) so that the mean square error is brought into relation with the 
system characteristic and the input. Thus 


(12) > ia | Pit de or nd 


28 - Supplemento al Nuovo Cimento. 
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Expanding the expression we have 


= 


(13) e= lim 7 far] favo it — T) ar fa) f(t — a) do — 
To CU EL 
— 2540 | Ma) HE mar + fi 
Inverting orders of integration we have 
Le] foo) T 
1 
(14) € = fa) dr [h(0) do lim sp | ft — t)fi(t — o) dt — 
— o — œ© —T 


(co) cid T 
È 1 . MES 
2 | h(x) dr lim > i falt) fit 1) dt + lim 37 | fit) at 
n -T —T 
Since by (1) 


(15) pu(t —0) = lim = [f,(t— 1) ft — 0) at 


is the autocorrelation function of the input with the argument (T— 0), and by 
(4) and (5) 


T 
1 
16 — lim — (re } 
(16) Pia(T) im. sp [Mott T) dt 
is the input-desired output crosscorrelation function, and 
(7) Paa(0 = = lim ban t) dt 


is the mean-square value of the desired output expressed as the value of the 
autocorrelation function of the desired output for t= 0, we write (14) in terms 
of (15)-(17) and find that 


co 


(18) È =[ maar [ato dog,,(t — ©) — 2 fam At Pat) + Paa(0) . 


— o 
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This is the relationship we wanted to establish. Obviously, the mean-square 
error is dependent upon the system characteristic A(t), but it is interesting 
and important to note that it is also dependent upon the autocorrelation 
function of the input g;;(7) and the input-desired output crosscorrelation of 
the system g;4(t), as well as the mean-square value of the desired output 
Paa(0). Let us remember that the mean-square error (18) holds for any given 
h(t), not necessarily the one that minimizes the mean-square error, any given 
P(t) and any given y,,(t). In other words, for any given system, characterized 
by A(t), with an input f,(¢) which has an autocorrelation function q;;(t), and 
a desired output f,(t) which has the crosscorrelation function œ;4(Tt) with f;(t), 
the mean square error as defined by (11) is expressed by (18). 

The next step in the problem of filtering and prediction is the determination 
of a condition under which the mean-square error (18) is the minimum. From 
the observations we have just made it is clear that the condition will involve 
Pii(t) and g;a(t), the two functions which specify the problem. 


8. — Minimization of mean-square error. 


Since in filtering and prediction, f.(t) and f,(¢) are specified ior any par- 
ticular problem so that ;,(t) and g;(t) are completely determined at the 
outset, the only change that can be made for reducing the mean-square error 
is a change in the system characteristic h(t). The finding of a condition re- 
lating h(t), g;(t) and y(t) for minimum meansquare error is a problem in 
the calculus of variations. Accordingly we let e be a parameter which is in- 
dependent of h(t), and y(t) be the variation of A(t) with the condition that 


(19) nt) = 0 for t<0. 


Since h(t) is the system time response to a unit-impulse excitation applied at 
t— 0, and the system is initially at rest, it is necessary that 


(20) h(t) = 0 for t < 0. 


It follows immediately that if 7(t) is a possible variation of h(t) it must satisfy 
the condition (19). When h(t) undergoes the variation ey(t) we let the corres- 
ponding variation in e be de. It is known that a necessary condition for 
minimum mean square error, €; is that 


(21) ae (e + de) |e = 0 for all possible 7 . 
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Introducing the variations in (18) we have 


(22) e+ de {um + en(t)]dt|[h(c) + €n(c)] dog: (t — 0) — 


— © 
Lee] 


- 2 fn + en(T)]ATP:a(T) + Paa(0) , 


— œ 


which can be simplified to 


(ce) eo 


n 
(23) e+ de =e+ 2efntn ar h(o) do@ii(t — o) + 
a =| dr fn(0) dogii(t — 0) — 2e[ymargal® 3 


Applying condition (21) to (23) we find that for &,.in it is necessary that 


n 


(ce) (ce) 


(24) Jac dr fo) dog;;(T — 0) fio AtPia(t) = 0 for all possible 7. 
Rewriting (24) we have 


(25) fio LI (t — o)do — val) 0) for all possible 7. 


_ 0 ey 


LI 


Since the range for t is (— co, oo) we shall consider (25) for (— © < + < 0) 
and then for (0< T< co). According to (19), 7(7)=0 for (~co<1t< 0). 
Hence (25) holds although 


œ Ù 


(26) fora — 0)do— alt) , 


_ o 


is not necessarily zero, and it generally does not vanish for (— c0'< tm 
On the other hand, for (0< 7 < 0), the expression (26) in (25) must vanish 
because if it does not, an #(7) can be found such that the condition (25) is 
violated. The conclusion is that a necessary condition for €, is that 


n 


(cel 


(27) Pios(or pate — ©) do — gia(t) = 0 for +207 


_ o 
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Here we let h,,,(¢) denote the time response of the optimum linear system to 
a unit-impulse excitation. We call a system an optimum system when the 
mean-square error is the minimum. Equation (27) is of the Wiener-Hopf 
type, and we shall call it the Wiener-Hopf equation. 

Since (21) is also a condition for maximum mean-square error, we should 
establish the fact that (27) is for minimum not maximum mean-square error. 
This fact can be shown quite readily, but for a short presentation of the subject 
the details will be omitted. 


4. — Solution of the Wiener-Hopf equation. 


Before the discussion on the solution of the Wiener-Hopf equation which 
will yield the optimum linear system characteristic in terms of the transforms 
of the input autocorrelation function and the input-desired output crosscor- 
relation function, we should consider some background material. 


41. Transforms of correlation functions. — An extremely important theorem 
in the theory of filtering and prediction is the Wiener theorem for autocor- 
relation. It states that for a stationary random process f,(t) whose autocorre- 
lation function is 


i 
(28) Pool) = lim 7 | fol + at 


the power density spectrum ®,,(w) of f.(t) and its autocorrelation function 
are Fourier transforms of each other, that is, 


co 


(29) Paa(T) =| 2.0) exp [joT]d , 
and uh 
(30) D,,(@) = ml alt) EXP [—Jot]dt, 


— © 


where © is the angular frequency. 

For the physical meaning of the power density spectrum let us consider 
the wave of Fig. 1 whose autocorrelation function is given by (3). In accor- 
dance with the Wiener theorem, the power density spectrum of the wave is 


E? 2k 
x (2k)? + 0° 


G1) Pula) => | 2 exp [—2k |r|] exp [—jor]dz = 
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A sketch of this function is given in Fig. 3. If we consider the wave in Fig. 1 
as a voltage or a current associated with a one-ohm load of pure resistance 
then the sum of the shaded areas under the curve of ®,,(@) between the bands 
(@,, ©) and (—@,, — ©) represents the power supplied to the load by com- 

ponents of f,(t) of all frequencies in the band 


Sng) (01; ©»). The total area under the curve re- 
presents the total power supplied to the load 
since 


(co) 


di 
(32) [ocre = Paa(0) = jim op A0 at à 
SA -T 


| 
| 
Fa ti da à 


ics. 


Analogous to the Wiener theorem we can relate the crosscorrelation func- 
tion P(t), for f.(t) and f,(t), defined as 


T 


(33) pa(t) = lim = | falt)folt + 1) at, 


nt 4! 
to its Fourier transform P.o(0) by the expressions 


Le] 


(34) Pav(T) = fto exp[joT]dw, 
and iù 
(35) Da(o)= 2 p[—j 

av(@) ark on Pav(T) exp [re jot] dr. 


19 


We shall call @,,(@) the cross-power density spectrum of fa(t) and fi(t). 
Although it possible to consider a particular situation where fa(t), fit) and 
Palo) correspond to physical quantities, we find it unnecessary to look for 
a physical interpretation of Palo) in every problem. Unless it has a meaning- 
ful and direct physical interpretation in a problem we shall consider the cross- 
power density as a mathematical quantity. 


42. The input-output crosscorrelation theorem for a linear system. — For a 
linear System with the unit-impulse response h(t), an input fi(t) which is a 
ape random process, and the corresponding output f,(t), an important 
relationship exists among h(t), the input autocorrelation g;;(t) and the input- 
Se HERO ele tion: (ays To derive this relationship let us first write 

e expression for the input-output crosscorrelation of the linear system ac- 


3 
a 
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cording to definition. Thus 


(36) giolt) = lim oo | pf + +) dt . 


To bring the system characteristic into this expression we introduce the con- 
volution integral (6). In so doing we have 


(37) Pio(T) = gr Ji fo i ET -0)do. 


By inversion of the order of integration, (37) becomes 


foo) 


(38) Pio(T )= fate ae 2 ft fit + 7t—o)dt. 
Since 
(39) Pii (T — 0) = lim fon (t)f.(t + T — o)dt 


it follows that (38) is 


[co] 


(40) Pio(T) = [ops 6) do 


We have therefore shown the following theorem: The input-output cros- 
scorrelation of a linear system is the convolution of the system response to 
unit-impulse excitation and the input autocorrelation. 

This theorem plays an important role in the statistical theory of com- 


munication. 


4°3. Relation between the unit-impulse response and the system function. — 
In the frequency domain a linear system may be characterized by its function 
H(w) which is the ratio of the output to the input, as a function of the angular 
frequency ©, when the input is a steady sinusoidal voltage or current. As 
we know, the input and output are expressed in the complex form so that 
H(w) is a complex expression which contains the amplitude and phase spectrums 
of the system. We shall state here for reference the well known relations 
between the linear system unit-impulse response h(t) and the system function. 
These relations are: 


(41) h(t) = = fu) exp [jot]do , 


=D 
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co 
ps 


fre exp[— imt]dt . 


= 
bo 
— 
ea) 
S 
Se 
I 


Now, having introduced the necessary background material in Sect. 41 
to 4°3, we return to the Wiener-Hopf equation (27) for discussion and solution. 
Noting that there is great similarity in appearance between (27) and (40) we 
shall investigate the significance of the Wiener-Hopf equation particularly with 
respect to the restriction t > 0. 

Let us first consider the term 


(ce) 
. 


(43) | heal) u(t — 0) do, 


Stoo 


in the Wiener-Hopf equation. 
A sketch of the functions involved, just for the purpose of explanation 
without any reference to a specific problem and disregarding precision in 
drawing is given in Fig. 4(a). 
The convolution of h,,.(a) 


a HA eo and g;;(0) results in the curve 
= of Fig. 4(b). In the light of 

5 = = Di input-output crosscorre- 

DI input-output — ie + ne theorem (40) the con- 
of optimum system dar Od BNE Ch do, volution (48) for all values 

7 z of t must be equal to the in- 

Pia(T) put-output crosscorrelation 

(63 of the optimum system. 
5 = Now, if we refer to (27) and 


fai consider the whole statement 
sei LAS foro Pii(t-0)d0 - (©) we find that it states that 
the optimum linear system 
must be such that its input- 
output crosscorrelation is 
equal to the input-desired 
output crosscorrelation for + > 0. It must be emphasized that for tT < 0 these 
crosscorrelations need not be equal, and are generally not equal. The reason 
for this is that the desired output is generally not possible to obtain without 
error under the conditions of the problem. If the desired output were possible 
to obtain without error then (27 ) holds for all values of 7 and the solution of 
phe problem is trivial. The difference between (43) and g,a(t) which is sketched 
in Fig. 4(c), in accordance with the Wiener-Hopf Gandia is therefore DS for 


Fig. 4. 
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t > 0, and is generally not zero for 7 < 0 as illustrated in Fig. 4(d). Let us put. 


(cel 


(44) a(t) =[al a) pat — 0) do — galt) 


— o 


This consideration of the Wiener-Hopf equation shows that to obtain minimum 
mean-square error the linear system must be so designed that its input-output 
crosscorrelation equals the input-output crosscorrelation for +> 0. For all 
other values of 7, that is, for t< 0, there is no restriction. 

The solution of the Wiener Hopf equation 


starts from the fact that q(t) as given by (44) f(t) ion) Alpine 
is a function that vanishes for 7 > 0. For | x 

this reason let us note that if a function f t w 
f(t) >0 as t-+ co and f(t)=0 for t< 0, as (a) (b) 
illustrated in Fig. 5(a), then its Fourier tran- Fig. 5. 


sform F(A) as a function of the complex 
variable À — @4+-jo has no pole in the lower half-plane (lhp) as shown in 
Fig. 5(b). The poles of F(A) are in the upper half-plane (uhp). We have 


(col 


(45) FAO = Ji pt ar, 
and 
(46) j(t) = fe exp [At] dA . 


2) 


As an example, consider 


A exp [— at] ion, t= 0, 
47 f(t) = 
0 Wee (= 0, 
then 
Lo. ee 
(48) HO) AL exp[— at] exp [— jAt] dt= mei i 
0 


and the pole of F(A) is found from 


(49) a IA =; 
or 
(50) a+ j(© +0) = 0, 
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which gives the location of the pole in the uhp as 


nee (51) o=0, jo=ja. 
g(t) 
| x] x + On the other hand if g(t) —0 as t > — co 
2 and g(t)=0 for t>0, as illustrated in 
(a) Fig. 6. ©) Fig. 6(a), then its Fourier transform G(4) 


as a function of À has no poles in the uhp. 
The poles of G(A) are in the lhp, as shown in Fig. 6(b). The relations between 
g(t) and G(4) are 


œ 


(52) aa => | g(t) exp[— j21) dt, 
and DE 
(53) g(t) = [ec exp [jAt|d/. 


Returning to q(t) in (44) we see that it behaves as g(t) so that its Fourier 
transform has no poles in the uhp. Let 


(54) QU) = ge fam) exp[—jarlar. 


(co) @ 


1 
(55)  Q() = ag |P j47]dT | frtorputr — 0)do — pia(t)| = 


= Hy (A) Dii(4) — PB) , 


in which H,,(A) is the system function of the optimum system, as a function 
of 2, which is related to ht(t) by (42). The input power density spectrum 
(A) and the input-desired output cross-power density spectrum ®,,(1) re- 
lated to g(t) and g(t) by (30) and (35) respectively. Because of the fact 
that q(t) vanishes for t > 0, we have the result that Q(A), or 


(56) [H...(A) Dii(4) — G,(A)] has no poles in the uhp. 


At this point it is necessary to introduce the idea of spectrum factorization. 


(57) felt) = f(t) + fat) 
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and the input autocorrelation is 
(58) Pis(t) = Pmm(T) + Pmn(t) + Pan(T) + Pnm(t) « 
After transformation in accordance with (30) and (35) we have 
(59) P;;(0) = Dmm(0) + Dan(0) + Dmn(0) + Dam(0) . 
In solving the Wiener-Hopf equation by the method of spectrum factorization 
we need only factorize the input density spectrum. To show the idea let us 


assume that the message f,,(t) isa Poisson rectangular wave whose power den- 
sity spectrum is given by (31). Furthermore, for simplicity, we put 


(60) Dara (0) = 1 +2 È 
If the noise f,(t) is a white noise, then 
(61) Dar(W) = a? , 


which is a constant. We further assume that PD, (0) — 0. The input power 
density spectrum (59), for the case of (57), becomes 


Ik 1 + a? + do? 
(62 tas pei: » 
(62) Pi;(0) i+ a dg 


Writing this expression in À and factorizing we obtain 


a(b + ja) a(b — ja) 


where b—4/1 + «Ja. If we put jo (e 
| per 
+ — Ob + j4) 95,0) a 
(64) Di, = spe EGA ) DER 
-jb 
we shall find that @;;(A) has a pole in the (a) (0) 
uhp at (0=0, jo=j1) and a zero in the Fig. 7. 


uhp at (© = 0, jo = jd) as shown in Fig. 7(a). 
Similarly we put 
alo — ja) 


(65) Pid) = 77 


’ 
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which has a pole in the lhp at (0=0, jo =—j1) and a zero in the Ihp at 
(0=0, jo =—jb) as shown in Fig. 7(b). 


In the factorization of the input power density spectrum we assume that 
it is possible to express the spectrum in the form 


(66) B;(2) = Di; (A) D;;(4), 


in which Dj(4) contains all the poles and zeros of ®;;(2) that are in the uhp 
and @,,(A) contains all the poles and zeros of ®,;(2) that are in the Ihp. Fur- 
thermore 


(67) Di;(4) = D;;(4) ; 


where the bar over @,,(A) denotes the conjugate of the function. It is im- 
portant to note that ®,;(@) is an even function, and that if it is expressed as 
a rational function we shall be able to factorize it in accordance with (66) 
and (67). 

We now return to the consideration of (56). If we multiply this expression 
by 1/®,,(A) we shall find that the resulting expression still has no poles in 
the uhp. To see this let us note that ®,,(4) has no zeros in the uhp so that 
its reciprocal cannot have poles in the plane. The result of the multiplication 
is that 


(68) Hal A) DHA) = has no poles in uhp. 


For simplicity in the writing of the next few steps we put 


(69) Pi) = Se, 
(70) ve | W(A) exp [jt] aa ve 
and 
1 Ri 0 t 
(71) H(A) = 5 | wt) pia. ae 
Re ga: 


Bice ®,4(A)/;,(2) in general has poles in both half planes, its transform w(t) 
in general does not vanish over a half line. An example of y(t) is shown in 
Fig. 8. We shall write the right hand side of (71) as the sum of two integrals, 
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‘one over the interval (— co, 0) and the other over the interval (0, co). Thus 


(2) WA) = 5 | ye) exp (— jatar (persa. 


Substituting (72) in (68) we have 


(ce) 


si ) exp [— jAt] dt — 


(73) lH (A) BEA 


fe ) exp [— jAt] dt has no poles in the uhp. 


— © 


By considering the location of the poles in the expression as a whole, and in 
the components we shall finally obtain the solution of the Wiener-Hopf equa- 
tion. First, the optimum system function H,,(A) must have no poles in the 
Ihp because its transform, the response to unit-impulse excitation, behaves 
as f(t) in Fig. 5(a). Next, the function @,,(4) has no poles in the lhp by 
definition. Hence 


(74) H,,, (A) Di; (A) has no poles in the lhp. 

The term 

(75) ; za fo exp [— jAt] dt has no poles in the lhp, 
0 


because it is the transform of a function that behaves as f(t) in Fig. 5(a). 
Therefore we see that the first two terms of (73) 


(76) Hopi(4)Di;(4) gi fo exp [— jAt] dt has no poles in the lbp . 
0 


The property of (73), that it has no poles in the uhp, must now be utilized. 
In (73) the term 


x i 
i x {vo exp [— jAt] dt has no poles in the uhp, 


because this is the transform of a function that behaves as g(t) in Fig. 6(a). 
The location of poles of this term is in agreement with the overall property 
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of (73). To satisfy this overall property it is therefore necessary that the re- 
mainder of the expression À 


(77) Hd) Di;(4) — FILO exp [— jAt] a has no poles in the uhp. 


Now, since this expression has been found to have no poles in the lhp also, 
as stated in (76), it is an expression that has no poles in the whole plane. We 
conclude that 


(78) H(A) Di;(4) — x |v exp [— iAt] dt = const. 


We can show that the constant should be zero by further analysis, but we 
shall omit the details here. We have, finally, from (78) 


(cel 


Là 1 - 
(79) Hold) = gigi | vO visa, 
where 
PB) 0a 
(80) p(t) - (34 exp [i/t] dA. 


This is the solution of the Wiener-Hopf equation (27). It gives the optimum 
system function explicitly in terms of the input power density spectrum in 
its factorized form, and the input-desired output cross-power density spectrum. 
We must emphasize the fact that the solution is in a general form without 
restriction on the manner in which messages and noise are combined. and 
without restriction on the form of the desired output as long as it has a non- 
zero correlation with the input. Obviously if the desired output and the input 
have a zero correlation the problem is a trivial one. A common form of the 
input is the sum of a message and a noise, but the theory is not restricted to 
this form of input. However, we must bear in mind that in some problems 
the optimum system may yield a very poor result because the theory is re- 
stricted to the linear system. A case in point is that in which the input is 
the product of a message and a noise, and the desired output is the message. 
Let us again observe that given a Stationary random input with a power den- 
sity spectrum Bio) that can be factorized in accordance with (66), and a 
stationary random desired output that has a cross-power density spectrum 
®,,() with the input, then (79) and (80) specify the optimum linear system 


in terms of these spectrums. This is a very general and important result which 
applies to a large class of problems. 
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5. — Illustrative examples of filtering and prediction. 


51. Filtering. — Referring to the discussion under Sect. 2, if the input. 
in a filter is 


(81) Filt) = fnlt) + fn(0) 
then for a lag filter we specify that 
(82) fa(t) = f(t — a) 


in which « > 0 is the lag time, and for a lead (prediction) filter we specify 
that 


(83) fa(t) FE PAU + 2) 


in which x > 0 is the prediction time. 
In applying (79) and (80) to the present problem we first determine @;q(T). 
Accordingly we have 


(84) Pia(T ths = im - ap fron At) dt lim op [ltl brea) dt pate a). 


To OUR 
=u 


From this crosscorrelation function we obtain ®,,(2) by transformation thus 


Le] 


po ; 1 
(85) Did) = 5 Jet + a) exp[— jar] dr = exp [ + jad] ®in(A) 


Substituting this result in (80) and retaining the general form (79) we sub 
the following formulas for the lag filter or the lead filter: 


il fà ; 
(86) Hop(4) = aban IO exp [— jAt] di , 
"Din À . 
(87) ni) [Get lita, 


where — « is for the lag filter and + is for the lead filter. In (86) ®,,(A) 
is given by (59) and in (87) 
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For a specific example, let us assume that Dyn(0)= 1/(1+@?), D,n(@)= a’, 
and ®,,(o)—0. This ‘example has been chosen to illustrate spectrum 
factorization (see (60) to (65)). We shall determine the system function of 
the optimum lag filter. Substituting (65) in (87) and noting that Dim(A)=Pmm(A4), 
we find 


(co) 


À ib ay) £ 
(89) p(t) -| ; + 2% =H exp [j(t — aA] da = 
27 
PPT at [— (t— a)] for +>; 
i TEN exp [b(t — 2)] for t<a. 


We shall next evaluate the transform 


ao 


- 1 : 
(90) = p(t) exp [— jAt] dt 


0 
in (86). This transform is 


ee 1 i I 
ED x fr eta | fexp[ot—a)) exp(—i24 + 


0 
+ | exp [— (¢ — «)] exp [— jat] dt| — 


1 L A - 
ral +b} (beast EIA) [(1 + b) exp [— jad] — (1 + ja) exp [— ba]] . 


Finally, according to (86), we multiply this expression by 1/®*(A) to obtain 
the optimum system function. The result is that 


1 L 
(92) (A) = al tbe TEA [(1 + b)exp [— jo] —"(1 + ja)Jexp [— ba]], 


characterizes the optimum lag filter. 

The specification of the optimum filter characteristic in the form (92) may 
be inconvenient for certain problems, such as synthesis. In such a situation 
it may be helpful to Specify the filter in the time domain. Applying (41) 
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to (92) we find that the optimum lag filter has the unit-impulse response 


0 for — co<t<0 
exp [— ba] 
—— [db cosh bt + sinh dt for O0<t 
(93) healt) =} @(1 +5) | i ii) 
exp [— dt] e 
FE b à D È 
a?b(1 + byl cosh ba + sinh ba] for ¢@<t<jeo. 
This response is sketched in Fig. 9. ko) 
opt 
52. Pure prediction. — Referring to Sect. 2, 
equations (9) and (10), we have for pure prediction ; 
È 0 a 
Fig. 9. 


eo ew FI DI 
(94) pala) = lim = | hilt) falt + 1) dt = 


. 1 
Fr lim ae | falta Hi T se x) dt ni Pmm(T + x) L 


T —o 2T 
= 
The transform of this function is 
ee 
(95) @,(A) = se [ponte + a) exp[— jAt] dt = exp [jad] Dmm(A) . 
Since f;(t) = fn(t) we have 
(96) Da(2) = Dyn (A) = Din (A) Bray A) 
so that 
(97) DE) = Dim) 
and 
(98) | D;;(4) = DamlA) . 


The ratio @,,(A)/®,,(A) in (80) is therefore 


Pia(2) 
Di;(4) 


Dmm(4 O "E ) 
“5 is exp [jad] = Dim(A) exp [jad] . 


mm 


(99) 


With (97) and (99) we now write the formulas for the pure predictor, in 
accordance with the general expressions for the optimum linear system (79) 


29 - Supplemento al Nuovo Cimento. 
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and (80), as follows 


œ 


(100) Hoy(A) = De 5 i p(t) exp [— Jat] de, 
(101) y(t) = [os exp[jA(t + «)] dA. 


ser : 


We note that pure prediction depends solely upon 7, (À). 
For a specify example of pure prediction consider a message f,(t) whose 
power density spectrum is 


1 
(102) Dinn(@) = (1 FR 0°)? . 
For this spectrum we find 
(103) dba ate 


Transforming this expression according to (101) we have 


nt + x) exp[— (t + «)] for t>—a, 


qos) w= | expla + aa 
| 27 
| 0 for t<—a. 


The transform in (100) is 


œ (ce) 


1 
(105) os Eo exp [— jAt] dt -[e + a) exp [— (ft + «)] exp[— jat]dt= 


Ù 0 


A RA x 
(1+ ja)? 14 jQ|° 


= exp[— a] 


Therefore the system function of the optimum linear predictor for the mes- 
Sage whose power density spectrum is given by (102) is 


(106) H(A) = (1 + ja} exp [— ale È ni tes = 


= exp[— a][(1 + à) + jad]. 
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Im this example it is easy to see that the system fnetion (106) may be realized 


as in Fig. 10. 

It is particularly interesting to note that the factor exp [— x] indicates 
that whereas the output amplitude should be comparable with the input ampli- 
tude for very short prediction time, 
it should decrease exponentially as peste” 
the prediction time increases. Ac- fr (t) 
cordingly, as the prediction time 
tends to infinity, the output should 
tend to zero. Practically, then, for Fig. 10. 
very long prediction time the best 
output of the predictor on the mean-square criterion is zero output. 

The amplification factor in the optimum system function is important, 
although practically when the wave form of a desired wave is satisfactory its 
amplitude is of secondary importance and sometimes it is of no consequence 
at all. The importance of the amplification factor can be seen from the fact 
that when the measure of error is the mean square error two waves of iden- 
tical form will not have zero error unless their amplitudes are the same. 


Attenuator 


EI CE) 


6. — Errors in filtering and prediction. 


To find the expression for minimum mean square error in the theory of 
optimum systems we begin with the expression for means quare error (18). 
In this expression h(t) is not necessarily the optimum system. However, when 
h(t) satisfies the condition (27) the expression will then be for minimum mean 


square error. Therefore imposing the condition (27) in (18) we have the mi- 
nimum mean square error, 


[co] 


(107) e e | dr 


— o 


We shall consider the lag filter for which (86) and (87) have béen derived. 


For this filter qua(0) = Pmm(0), Pia(T) = Pim(T—@) so that 


(108) Sk PaO) ee | Ront(T) Pim(t — &) AT . 


lee} 


To reduce this expression to a form that shows the effect of the lag and 
involves simple computation, we shall substitute in (108) the transform (41) 
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for h.,(t), that is, we shall put 


opt 


œo 


i : 
(109) Ropt(T) = — | Hom(A) exp [17] da. 


on 


_ 0 


Then for H,,(4) we substitute the expression (86). In so doing we obtain 


F La 1 
(110) Emin Pmm(0) — fonte == x) dt 2 exp [jt] di 2rDi;(A) . 


feo) 


iio exp [— jat]dt. 


By inversion of the orders of integration we have 


tial apie: THE hl a 
(111) Emin = Pmm(0) FE Da y(t) di | F(A) exp [— jAt] di 
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È fot — x) exp[jAt]dt. 
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With the change of variable 7 —x=%v we get 


al r È 1 A 
(112) enn=@Pmm(0)— an w(t) alza exp[— jA(t — «)] dA : 
0 = 2 


. 


PI 


[PC exp[iAv]dv. 


The integral on the extreme right-hand side of this equation is @,,,(A) so 
that (112) becomes 


i © © F 
(113) Emin = Pmm(0) >. da p(t) a tan exp [— jA(t = x)] di. 


By comparison with (87) (for a lag filter) we see that all the quantities in the 
integrand of the extreme right-hand integral in (113) are the conjugaties of 
the corresponding ones in the integrand of (87). Since in (87) p(t) is real; 
the conjugation of the quantities in the integrand leaves the result of integ- 
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ration unchanged. In other words 


i Dim(A) =, 
114 (t) =| PX exp [—ja(t — 
Therefore (113) is reduced to 
il r 
(115) Emin — Pmm(0) — fo dt. 


This is a simple expression which permits the determination of the minimum 
mean-square error without going through the determination of the optimum 
system. The most important factor in the expression is the ratio 


(116) Din (A)/®;,(A) . 


For convenience we shall let y(t) be y(t) when «= 0. In terms of y,(t), (115) is 


foe} 


(117) Emin = Pnnl0) — = | 805) at 
2x 
“x 

Since y(t) is always positive we conclude from this expression that the mi- 
nimum mean-square error decreases with increasing lag. In other words, the 
performance of a lag filter improves with increasing lag. This is a very inte- 
resting and important result. As the lag tends to infinity the minimum mean 
square error tends to an error that cannot be removed by the linear system. 
We call this error the irremovable error &,. Hence the irremovable mean 
square error is 


(118) Em — Pmm(0) DANSE po(t) dé. 


It is quite easy to see that if we consider a lead filter than the minimum mean 
square error is given by (117) with — x replaced by + that is 


(119) Emin Pmm(0) gun 2x. 


+ 
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SS 
ow 
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For pure prediction this expression can be further simplified. From (119) we 
find that for long time prediction with +« tending to +co, the minimum 
mean-square error tends to the message power. This means that the output 
of the prediction filter tends to zero as the prediction time tends to infinity. 
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7. A general method for expressing the desired. output. 


In some filter problems the desired output may not be the message but 
rather the derivative of the message. One might want the integral of the 
message. A general method for expressing desired outputs of this type is to 
say that the desired output is the message that has been given a linear ope- 
ration. One example of this is differentiation and another is integration. If 
we let G(w) be the linear operator in the frequency domain and g(t) be its trans- 
form then 


œo 


(120) falt) = | Haale oh ae 


To introduce this desired output into the general expression (80) we need @,a(T), 
which is 


œ 


Stai! . ° 
(121) qu(r) = lim =, | ful) falt + x) at = lim | f.(t) at foto) T-0)do = 


(co) co 
pr 


RL 
È g(a) do Lim sig JA +t—o)dt = [IO —o)do, 


To Z 


By transformation of both sides of this equation we have 


(122) Dia (4) Lo G(4) Dim (4) ? 
and (80) becomes 
d (4 im A +4 
(123) p(t) -| ve exp [jAt| da. 


For example, if the desired output is the derivative of the message with lead a, 
then 


(124) G(A) = jA exp [jad] . 


In specifying the desired output in this manner we are actually demanding 
the linear filter system to differentiate the message in the presence of noise 
and to predict the result, all in one step and with the least mean square Ca 
A method that can handle a problem of this type is indeed a powerful 2 
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SUPPLEMENTO AL VOLUME XIII, SERIE X N. 2, 1959 
DEL NUOVO CIMENTO 3° Trimestre 


« Learning » Filters, Predictors and Recognizers. 


D. GABOR 


Imperial College - London 


1. — The principle. 


The idea of filters optimizing themselves by «learning » and the basic 
mathematical formalism was first made public in my report on Communication 
and Cybernetics to the International Symposium of Electronics and Television 
at Milan, April, 12, 1954 (reprinted in [RE Proc., Vol. CT-1, 19, (1954)). 
The realization was held up by the difficulty of finding collaborators, and also 
by the lack of suitable analogue multipliers, which form an essential part of 
the scheme. Recently KUHRT and HARTEL have developed remarkable ana- 
logue multipliers based on the Hall effect in the new semiconducting mate- 
rial indium arsenide, (F. KUHRT: Higenschaften der Hallgeneratoren, in Siemens 
Zeits., 28, 370 (1954); W. HARTEL: Anwendung der Hallgeneratoren, in Siemens 
Zeits., 28, 376 (1954)). The time now appears ripe for attacking the pro- 
blem in earnest. 

The principle described in my paper l.c. may be briefly recapitulated. It 
is based on the fact that in a limited waveband of width F the information 
can be considered as arriving in the form of discrete data; one every 1/27 seconds. 
From this follows the important consequence that all operators in a finite wave- 
band can be represented in algebraic form, operating on 2F data per second. 
It is convenient to take as data the samples of the signal amplitude s,, at 
discrete sampling points 0, 1, ..., n, spaced by 1/27, where 0 corresponds to the 
present instant and n is positive in the past. (Amplitude samples are con- 
venient for easy illustration, but one can as well take as data the expansion 
coefficients of the signal in terms of any set of suitable functions, such as 
described in GABOR, l. c.). 

Let O be the operator of which the filter is a physical realization. It ope- 
rates on the past samples of the incoming signal s(t), de. on 8), 813 Sn, Sy- 
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It is convenient, (and of course practically unavoidable), to assume that 
beyond a certain time N /2F the past does not matter. We then write the 
most general transformation which a filter can effect in the form 


N 


N NON È "O ee 
0(8) = > 8ytyt D Es Tm + > DI 2 Fn Fn Enna °° La 
0 0 0 one 


0 


The r,... are the response coefficients of 1st, 2nd... order. They will be assumed 
as constants. It may be noted though that in the « most intelligent » filter 
they would have to be made dependent on past information. Such a filter 
would notice for instance, from information received, possibly further back 
than N, that the language to be filtered is English or French, and would adjust 
its response coefficients accordingly. We leave such universal filters out of 
consideration from what follows, and consider only filters adapted to one statio- 
nary time series. In this case the r, ... are constants, and translation invariant, 
as they must be. Optimization is then a matter of adjusting the set of the r,. 
The rule for optimization depends on the problem, and on the « success 
criterion » which one wishes to adopt. There are 3 types of problems. 


1) Filtering, i.e. removing the unwanted component (noise), from the 
signal received. 


2) Prediction, i.e. producing from the past values s,... sy a value with 


a negative index, say s_4, which is most nearly right in most cases. 


3) Recognition, i.e. producing a distinctive output for a certain type, or 
types of signals. In the simplest case of one type of signal only, this is the 
same all filtering; all unwanted signals are rejected as noise. The problem 
becomes distinct if there are several types of signal to be distinguished. The 
discussion of this problem will be left for later. 


There are also various success criteria which one could adopt. One could 
for instance try to maximise the rate of information, as defined by SHANNON. | 
This has the disadvantage that one cannot even formulate it as a rule for 
determining the filter parameters before one has solved the recognition problem, 
because Shannon’s definition is in terms of communication signs and their 
probabilities. Moreover, even in the simplest cases, in which one can give a 
simple mathematical description of the communication signs, the optimization 
becomes hopelessly complicated. We can therefore consider only criteria which 
pro simple funetions of the true signal amplitude s(0) or, in the case of pre- 
diction, of s(d) and of the value O(s) supplied by the filter. One could think 
e.g. of the integrated absolute deviation of the value O(s) from the true value | 
But all these criteria, with the exception of one, are not only clumsy to handle 
mathematically, but one can never be sure that there will be a wnique optimum. 
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The one exception, the one which which we adopt is the minimum mean square 
error criterion, as used by WIENER and by KOLMOGOROFF in their classical 
work on optimum linear filters. This is easy to handle, and is certain to give 
a unique solution. 


Consider for instance the prediction problem. For simplicity we write s, 
instead of s_,, for the value to be produced by the filter, so that d is positive 
if it represents a delay, negative for the case of prediction. We have then 
for the mean square error 


ne ef > N N NN 
(2) (O(s) “o Sa)? == > > Sn, Sng !n, "ng + DS > DS Sa Sn, aa ein 
0550 0 0) 0 


N N 


RUN: N N 
as > » > > PREF $n Ns Sil natia ngtty az > CPL $ mi > > Sa Sala Na "ei TS 
050.0 0 0 0 


0 


& © 


This is a quadratic form, moreover a positive definite quadratic form of the 
variables 7,... hence it must have a minimum, and this is unique. By this. 
we are assured that if we use the least mean square criterion in a « learning » 
i.e., self-optimizing device, it will not hook itself on to a subsidiary minimum. 

The coefficients of the binary products of the r,... are all of the form 


CRE ong ee 8m, > 


These are the autocorrelation coefficients of the stationary series s, of different 
order. We take the order as one less than the number of factors, so that the 
ordinary autocorrelation function 


8, 9m, — Pi (n — Ma) 


will now be called « of the first order ». The autocorrelation funetions are all 
translation invariant, é.e., they are functions of the differences in the », only. 
Hence the k-th order autocorrelation function is 


(3) Sn Sm eee Sy = Pr — Nos Ma — My see Na Mpa) + 


In the filtering problem the cross-correlation functions between the signal 
and the noise will also appear if the noise is additive (see GABOR, 1. e.). In 
either case, carrying out the minimization of the square error one is led to a. 
set of linear equations, with the correlation coefficients, (7.e., values of the 
correlation functions), as coefficients, and the r, as the unknowns. 

Solving these equations, (say a hundred), is in itself a formidable propo- 
sition, but collecting the coefficients, whose number is of the order of the square: 
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of the number of equations is practically hopeless. I have therefore proposed, 
in 1954 a short cut through these difficulties, by suggesting a machine 
which adjusts itself optimally by a learning or training process. This is much 
the same as what the higher animals are doing. The brain is, among other 
things, an astoundingly universal filter. The lower animals are born with 
« built-in » reflexes, ready for all everyday emergencies, but the immensely 
more complicated humain brain contains at the start little more than poten- 
tialities. It is, as it were, an enormously complex network, in which almost 
all the switches are still open. This may well be due to the simple fact that 
the gametes are unable to carry the information for the detailed connections, 
but however this may be, the fact that we must learn almost the whole anal- 
ysis of the sensory data and the reactions to almost all but the most elementary 
situations pays a rich dividend in adaptability. It appears that our machines 
have now also reached the degree of complication at which it does not pay 
any longer to plan all their reactions in advance. Instead we must make them 
potentially capable to cope with every situation, but leave it to them to 
acquire their reactions by « learning ». 


Delay 


Output 


Comparator 
(success crit) 


Integrate square error 


Adjustable 
filter 


(DE noise 


Minimum 
computor 


Fig. 1. 


Fig. 1 illustrates the « training machine » and is takenf rom the original paper. 
The filter has a great number of knobs, each representing one of the coeffi- 
cients 7, … adjustable e.g. by means of a potentiometer. There are two re- 
cords provided, one of the pure signal, the other of the signal mixed with 
noise. (In the case of prediction training one record will do, with two pickups). 
The mixed signal is fed into the filter and the output compared with the pure 
signal, (or « fair copy ») by a «comparator » which calculates the square of the 
difference, and integrates it over the whole playing time. The adjustment 
mechanism comes into action only when the whole record is played through. 
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Various strategies can be adopted for the adjustment of the great number 
of knobs, say 100. The simplest, though not necessarily the most efficient 
strategy is Southwell’s relaxation routine: only one knob is turned at a time. 
Starting from some first guess, one knob is given successively three positions, 
Say positive maximum—zero—negative maximum. This knob is reset to the 
original «first guess » position, and the same play repeated in turn with all 
the other knobs. The minimum computer now calculates the « influence curves » 
of every knob. In our case these will be simple parabolas, which are deter- 
mined by 3 points, hence as the original setting is one point, two other settings 
are sufficient. The minimum computer gives the best setting for every knob 
(if this alone is turned of course, the others remaining in the first guess posi- 
tion,) and indicates the one whose optimum setting gives the largest reduction 
in the integrated square error. In a fully automatic machine it will also turn 
the knob. The play is now repeated, until no worth-while further reduction 
can be obtained. 

This is a safe but slow routine, it requires 200 runs for one adjustment if 
there are 100 knobs. There is no doubt that it can be greatly speeded up. 
First, one will certainly notice after the first 200 runs that only a small 
fraction of the coefficients is effective. One can then discard all these which 
give, may be, less than 5% of the largest reduction, and return to them only 
when all the more important adjustments are made. Second, after the first 
200 runs one can assume as a working hypothesis that the order of importance 
of the others has not changed by setting the most important coefficient. One 
therefore goes through these in their original order of importance, setting 
them, after 2 runs in each case, to their relative optima. This procedure is 
also perfectly safe, because in the case of a positive quadratic form one is 
certain to proceed towards the absolute minimum so long as every step pro- 
duces a decrease. Thus, in this method, it takes at most 400 runs before all 
100 knobs are reset. One than starts the routine again. 

It will be of interest to investigate whether the alternative strategy of 
« steepest descent », proposed by G. TEMPLE in 1938, may not prove superior 
to Southwell’s relaxation routine. In this method, after taking a small 
section of the influence curves which can be considered as straight, a 
computation is made of the linear combination of alterations which produce 
the largest decrease. 

I propose to take the 2 (or for safety perhaps 3) points which determine 
an influence curve not in 2 (or 3) successive runs, but in one, by doubling 
(or trebling) the comparator. This means 100 runs for the first re-setting of 
100 knobs, and I surmise that it will be seldom necessary to repeat this more 
than 10 times, making a maximum of 1000 runs in all. As will be shown 
later, the machine can be made fast enough to process at least 100 data per 
second, 10000 data in a record of 1 min 40s duration. 1000 runs can be 
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made therefore in somewhat less than 24 hours (or a little more, taking into 
account the re-winding time). This is certainly not prohibitive if the machine 
is fully automatic. 


2. — Fields of application. 


The learning filter can be considered either as a scientific instrument for 
solving mathematical problems, or as a prototype model of practical filters, 
which are functionally (not necessarily structurally), copies of it. These copies. 
do not learn; the prototype has done all the learning for them. 

Applications of the second type are evidently the more important ones 
from a commercial point of view, but it will be better to suspend opinion for 
a while on the commercial value of this whole work. It is quite conceivable 
for instance that one will find that non-linear filters are not sufficiently efficient 
in telephony for justifying the probably very considerable costs. On the other 
hand, there can be no doubt about the scientific interest of the venture. 
The learning non-linear filter can be expected to make a long dash into a 
field of which only the outer fringes have been explored by mathematicians. 
It may well be that the result of the exploration will be that nothing much 
less complicated than the human brain can be of appreciable value for as- 
sisting the humain brain and senses. But even if the report should be 

«nothing but desert land for the 100 next miles », at least it will save expe- 
ditions equipped rof a score of miles only. 

I will classify the possible fields of application into: Communications, Con- 
trol problems and Statistical forecasts. 


3. — Communications. 


1) Telegraphy. The recognition of Morse signals in a noisy background 
is a promising field for non-linear filters, because the constant level of the 
Signals is a recognition index which is missed by linear devices. 


2) Radar. It will be most interesting to investigate whether a system 
exists better than « correlation reception ». In this system the received noisy 
signal is multiplied by the emitted pulse, with different delays, and there is 
a peak when this delay is equal to the return time of the pulse. 


The learning filter is very well equipped for dealing with this problems, 
because correlation reception can be achieved by using its linear terms only. 
Let the radar pulse correspond to the sequence x, ... Ly, … æ,. We then make 
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and the filtered signal is 
Ra, 


which represents correlation reception if N= M (this is several thousands in 
radar, but a smaller number will do in model experiments). Thus the learning 
filter can realise correlation reception by using its linear terms only, and it 
would be surprising if its higher order terms could not improve on it. 


3) Telephony. Here we must distinguish several problems of increasing 
complexity. 


3a) Filters for eliminating noise from speech. Human speech has a 
certain recognizable character for the ear; one can recognize it as having issued 
from a human mouth long before one can understand it. (Certain gurgling 
and clicking African languages excepted!) The best that linear filters can do 
is to use filters of the average spectral characteristic of the language. (Very 
carefully studied by the G.P.O. who have taken the spectra of almost all 
European languages.) But this is certainly not all one can do. The difficulty 
is only that these characteristics are almost certainly in the syllabic modu- 
lation, in the slow rythm of amplitude and frequencies, not in the wave-shapes 
or even in the short-time spectra. I think that little would be gained by trying 
out the filter simply on speech records, mixed with noise. 

It is very likely therefore that in order to be successful, the filter (or a 
device associated with it), will first have to take the signal to pieces, i.e., ana- 
lyse speech into the components which have been recognized as essential in 
the research of the last 30 years, apply the processing to these, and then put 
the pieces together again. 

It is easy to see though, that on the basis of present-day knowledge this 
‘would require a very complicated apparatus. The best waveband-saving appa- 
ratus at this moment is the Bell 32-channel Vocoder, which separates the 
spectrum of about 3600 kHz into 32 channels, in each of which a frequency 
band of 25 Hz is sufficient for transmission. This gives fully acceptable speech, 
of commercial quality. There are now claims that the 16 channel Vocoder 
has also reached this level. But even if we accept this claim, and put down 
the channel frequencies to 20, this still means 2X20 x16= 640 data per 
second. One requires at least 4 of a second to recognize the syllabic regula- 
rities, which enable the human ear to separate speech from noise, and this 
still means that the filter has to take in 80 data simultaneously. This is well 
beyond what one can contemplate for a start. 

On the other hand one could well think of eliminating noise with simple 
characteristics, such as Morse signals and clicks from speech. This must be 
left to experiments. 
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3b) Speech telegraphy. If we drop the criterion of « commercial qua- 
lity » the problem becomes more manageable, though still difficult. WALTER 
LAWRENCE has shown convincingly, that perfectly intelligible and human- 
sounding speech can be synthesized out of 6 slowly changing parameters. 
(The frequency and intensity of the larynx tone, the intensity of the « hiss », 
and the position of the first three formants.) Each can be transmitted in a 
20 Hz channel, hence the number of data per second is 2x20 X6 — 240. This 
still gives 30 data every is, but for speech recognition it may be sufficient to 
take 3 consecutive sets of 6 data, 18 in all, which is manageable. These extend 
over a time of 3x25= 75 ms, perhaps one can stretch it to make it 0.18. 
This is probably sufficient for recognizing most if not all morphemes, such as 
vowels, which are recognizable in themselves, certain consonants like « sh » 
which are also recognizable and combinations such as «ka» and « at ». 

I have mentioned before that in the simplest case, if only a single type of 
signal must be recognized, the process is not essentially different from filtering. 
If for instance we want to recognize «ka», the training record contains the 
speech record (in some form,) while the « fair copy » or pure signal record has 
everything wiped out except the « ka »-s which occur in the record. All other 
speech sounds are considered as noise, and the machine will do its best to 
eliminate them. 

It is easy to see that no linear filter could achieve this for speech recog- 
nition. In the linear case the problem is this: Given a wanted signal with 
the components x, +, and unwanted signals y; y, find such a set of co- 
efficients ?,. that 


N 
tat ae DI 
1 

while for all 7 
N 
D AU =. 
L 


This means orthogonalizing the unwanted signals to the wanted signal. It 
gives a set of linear equations, one more than the number of unwanted signals, 
and one can solve these if this is less than N. But in the case of speech, 
analysed in Lawrence’s way, in 3 consecutive intervals we have only 18 para- 
meters, sufficient to eliminate 17 unwanted signals only. (One can visualize 
this by imagining each sound-section as a vector in an 18-dimensional space. 
There are only 18 orthogonal vectors in such a space. ) 

A non-linear filter however can deal with this situation, if it has enough 
terms, as it is not restricted to linear combinations. In general higher algebraic 
forms have the drawback that if the terms balance at one level, they will not 
balance at all levels. One can avoid this by choosing for recognition only terms 
of one order, in which any one signal component occurs always with the same 


1274 


4 


> 
4 {LEARNING ) FILTERS, PREDICTORS AND RECOGNIZERS 463 
power only. For instance one could take bilinear terms only, specifying that: 
Dior all i 
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This gives 3N(N-+1) available parameters, a much greater number. For 
N=18 this is 171, which again is rather too large, but can probably be re- 
duced. Hence it is not hopeless to train a non-linear filter for speech recog- 
nition, by the method of filtering out one morpheme at a time. This, 
however, would give one (rather complicated) filter for each morpheme, a. 
rather impractical proposition. 

An improvement can be obtained as follows. 

Assume that we have 64— 26 morphemes to distinguish. We divide up 
the filter into 6 parts, with 6 outputs, and the training «fair copy » tape has. 
also 6 tracks. These have only three grades of intensity; positive, negative 
and zero, and contain the morphemes in binary code, zero being left for 
uncertainty, or unwanted noise. Thus the 6 filters are trained to give one 
digit each of the code; they give sufficiently strong positive or negative signals 
if a recognizable morpheme comes along, and zero for uncertainty, noise, or 
gaps in the speech. 

It is by no means clear whether training must lead to success, even with 
a very complicated filter, if the binary code is arbitrarily chosen, because this 
presupposes that each morpheme has 6 independent binary characteristics. 
(JACOBSON has shown, rather convincingly, that there are 7, but some may 
be redundant.) This is a matter for experiment. 

I have dealt with speech recognition at some length, but I do not want 

“ to conceal my personal opinion that of all robots the « automatic typist » is. 
the most unnecessary, next to the mechanical translator. It is however such 
a challengingly difficult problem, that it is not easy to avoid getting intoxicated 
with it. 


4. — Control problems. 


I will not attempt a systematic discussion, but will pick out three impor- 
tant problems. In the first two of these we want to achieve prediction. (It is 
easy to modify these into noise-filtering problems.) 


1) Straight follower with prediction. ‘We observe the instantaneous value 
s(0) of a quantity, e.g. the position of a target. This is taken as the input of 
the filter, while the output goes into a « system » which contains all the controls. 
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and the whole chain of processes which lead to the output quantity. (For 
instance if the output quantity is the position of a projectile at the later time d 
when it reaches the range of the target the «system » will include the gun- 
laying system and the ballistic process.) 
Let again be O(s) the operator of the filter and S the system operator, we 
prescribe 
SO(s(0)) = s(d) . 


Let S-1 be the inverse system operator, so that 
SS+=S+S=1. 


We obtain for the filter operator the rule 


In this form the problem is ready for the training process, if we know the 
operators S-1. (For instance the input at time 0 of the system necessary to 
achieve a certain result at the later time d.) The difficulty might arise that 
we know S, but we do not know S-1. In this case the filter itself can be used 
‘to solve the problem. We must only give it the instruction 


SO(s) = s 
and the solution is 
O = S$". 


But if we have an analogue of S, we can also put it in series with the filter 
and solve directly the first equation 


S(0s(0)) = s(d). 


2) Follower with feedback. 
We now have the equation 


S(O s(d) — s(0)) = s(d) 
or 


s(0) = (0 + S-1)s(d) . 


0+$S- is the inverse of the prediction operator; it produces s(0) from the 
later value s(d). There is no need to solve it. If we possess an analogue of 
the system we realize the connections as shown in the above sketch and have 
again a straightforward prediction-training problem. 

Both problems can be at once converted into filtering problems without 
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| prediction if we replace s(0) by signal + noise, and s(d) by the « pure signal », 


which in this case means the true value of the quantity of interest, which we 
waut to achieve, irrespective of the misleading observation. 


3) Analogues of systems which are only experimentally known. This can 
be called « Tustin’s Problem » as its great importance has been pointed out 
by A. Tustin: The Mechanism of Economic Systems (Appendix to 2nd ed. 
(London, 1955)), and who has also inspired the only two investigations on this 
subject which are known to me: 


J. B. RESWICK: Determining System Characteristics from Normal Operating 
Records, in Control Engineering (June 1955). 


T. P. GOODMAN: Experimental Determination of System Characteristics from 
Correlation Measurements, M.I.T. Thesis, June (1955) this contains also the 
full literature of the subject.) 


As far as I know, these previous attempts relate to linear systems only, 
and the imitation of system characteristics is not automatic. The problem 
is a straightforward one for the learning filter. The system input, as given 
by operating records, is used as input, and the system output as the « fair 
copy». The filter than converts itself into an analogue of the system. 

Applications to economic systems are particularly interesting. It is only 
to be hoped that sufficiently extensive records can be obtained. If one wants 
to test the prediction value of a filter which can exactly reproduce, say, 
100 data, one ought to have test records of at least 1000—10000 data. 


5. — Statistical forecasts. 


This is mostly covered by the last section. I want only to mention the 
particular interest in weather forecasts. 

The general problem of weather forecasting is roughly as follows. One 
observes n quantities of interest (such as temperature, air pressure, humidity, 
wind strength and direction, etc.) in N observation stations. In addition one 
has parameters such as the time of the year, sunspot activity, etc. From these 
one wants to forecast n quantities in at least one place. This n-vector is 
therefore a stochastic function of an x N-vector, and nN is a very large 
number. I propose to give our machine 18 inputs, so that we could cope at 
most with n= N=4 or n=3, N=6. It is hardly possible to say in advance 
whether this gives the machine a reasonable chance, even if we train it on 
time series in which the parameters (time of the year, sunspots, etc.) are 
chosen as carefully equal to the present situation as possible. There would 
be little point in the machine embodying the laws of air dynamics at this 
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stage, as the data are hopelessly inadequate for dynamical determination. 
I think though that there is a chance that we might get a good idea of « how 
complicated a machine would be adequate to deal with the problem? ». 

A problem more suitable for a relatively small machine is the one of the 
degree of correlation between two statistical series. NORBERT WIENER has in 
the last years worked out a most important measure for the linear correlation 
of two series. Our machine is in principle capable of tackling this problem, 
by using only its linear part, but making the r, terms numerous enough to 
span the longest delays between an event in the first series and in the second 
which may be of any importance. In principle it can, however, measure cor- 
relations to higher degrees. The measure of success is always the reduction of 
the mean square error in the reproduction of a series B by knowledge of the 
series A, i.e. taking this as the input. One can than take different series 
A’, A"... and thus quantitatively allocate the causation of B to A', A", etc: 
One can also test whether for instance 4 was more the cause of B or vice 
versa. These are the types of problems which NORBERT WIENER has pro- 
mised to investigate in his as yet unwritten book on The Grammar of the 
Semi-Quantitative Sciences. I think that there is a good chance of tackling 
them experimentally. 
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Television Compression by « Contour Interpolation ». 


D. GABOR 


Imperial College - London 


1. — The principle of contour interpolation. 


The enormous waveband of (3--6) MHz which is required for acceptable 
television pictures has been often contrasted with the limited intake of visual 
perception, estimated by YVES LE GRAND and others to be of the order of 
a score of bits per second. At present we can hardly dream of realizing this 
enormous potential compression ratio. It would probably require an analyser 
with a complexity comparable to that of the human nervous system, breaking 
down the raw data into familiar patterns (« Gestalten »), and filling in missing 
details from past experience. Contour interpolation is a modest first step in 
this direction. It takes in two data at a time (to be compared with one datum 
in ordinary television and probably a few thousand in the case of the eye (*)), 
and recognizes only one type of pattern; an outline, or more exactly a short 
bit of an outline which can be considered as straight without serious error. 

The progress from one sensing spot to two might not be very important, 
were it not that contour interpolation steps in at the weakest point of present 
day television systems, in which they are most wasteful. First, one of the 
two interlaced fields in TV pictures adds very little to the information; it is 
indispensable only for suppressing the flicker. Second the frame frequency 
(25 frames/s in Europe, 30 in the U.S.) is higher than justified by the time reso- 
lution of the eye; it is also justified only by the flicker. Films with 16 frames/s 
(with a shutter frequency of 48 s-!) are quite satisfactory if the ratio of bright 
to dark phases is sufficiently large, and in systems with « optical equalization » 
in which one picture fades continuously into the next, the frame frequency can 


(*) Cfr. the remarks of W. A. H. RusHToN at the First London Symposium on 
Information Theory (1950), on hundreds of rods connected to a single nerve fibre, a 
fact which suggests that « Gestalt » formation starts in the eye itself. 
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be reduced.to 6 s-1 or less. But, as will be shown, contour interpolation 

represents considerable progress for the display of motions over the 

best optical equalization systems. Hence it may not be too sanguine to 

expect from it a saving ratio of 2x4=8; 2 from the saving of every se- 
cond field, 4 from the saving of 3 frames in 4. 

Optical equalization is a case of or- 

dinary or linear interpolation, which is 


| illustrated in Fig. la). The two lines 
Ist line 


can be equally well considered as next- 


but-one lines in one frame, or as corres- 
‘nterpolated___| ponding lines in next-but-one frames. 
le If the signal amplitudes are s;(t) and 


| s(t) in lines 1 and 2, the interpolated 
2ndline 


Linear interpolation signal is 


_ 


a 


role 


8,(t) + $8,(t) . 


| NE It is evident that this preserves clear 
ae air. HT outlines only if they are vertical and 


1 


Lau stationary. Slanting contours and mo- 

) Ne SPL ving edges will be blurred. (Neverthe- 

7001482) \ less I think it highly likely that linear 

RI interpolation could be used for reducing 

2nd line i the frame frequency to one-half of the 
Intelligent” interp dlation usual value.) 

Fig. 1. « Intelligent » or «contour interpo- 


lation » is illustrated in Fig. 1b). If a 
step function starts in the first line at t, and in the second line at &, the 
interpolated signal is 


Sme — Hi + 6) = $8,(¢ —t,) + 48,(¢t —t,) . 
That is to say the interpolated signal is again a step function which starts at 
the interpolated time }(t,;+7,) and has the mean amplitude of the steps 8 
and s,. This is evidently the best guess in the case of outlines which are not 
too capricious, and it is the correct guess if the outline is that of a uniformly 
moving body. 

The question is only how to find corresponding points (x,, #, or t,, ta) 10 
two lines? One might think thatit is necessary first to correlate the two lines, 
and then to make two spots move along with uneven velocities, so that they 
arrive simultaneously at corresponding points, in which case one has only 
to take the mean of their simultaneous values. There is, however, a better 
way, without any intermediate storage of data. 
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Fig. 2 illustrates this method, which may now be referred to as 
« contour interpolation » in the case of interpolating between two lines of a 
frame. The top sketch shows the contour, in reality a narrow zone in which 
the amplitude changes steeply. It will be a 


x first spot 


when a contour is sharp and important enough 1 
to set the contour interpolating mechanism int. Gf == === — 
into action. (It is certainly worth-while to 2 
make it act for the outlines of a head, but 
not for every hair. If it is triggered off too atl tee 
easily, mistakes are more likely. This is a ite la ‘a ge 
matter for compromise). Pen 

Assuming now that the contour is impor- | | 
tant enough, e.g. that the gradient measured Se 
over a set distance exceeds a certain abso- | 
lute value. Up to this point the two scan- 
ning spots have proceeded in one vertical line, 
and at equal speed vw. When one of them (in 
Fig. 2 the upper spot), reaches the con- 
tour, it stops dead, while the second proceeds 
at double speed, 2v, until this too reaches 
the contour. At this moment the second spot 
stops dead and the first jumps up to double Fig. 2. 
speed. Subsequently the first spot is slowed 
down gradually, while the second accelerates until the two spots have again 
aligned themselves vertically, and proceed together with equal speed. Note 
that during this whole process their mean velocity £(v,+v,) was constant and 
equal to v%. 

It is now not sufficient to take the simultaneous mean values of the signal 
intensities, this would again produce a blurred outline. Instead we must adopt 
the following rule of signal distribution between the two scanning spots: When 
they proceed together, the amplitudes are added in equal proportions, 1.6. one 
half of s, is added to one half of s,. At the instant when the first spot has 
reached the contour line the whole signal is transferred to the one which is still 
moving, i.e. we take now 8,,—%, and maintain this until the second spot 
has also caught up with the contour. This procedure will give the correct 
result if the area at the left of the contour is structureless, or if it has a ver- 
tical structure, or, in the case of interpolation between frames, if this area 
represents a stationary background of any type, and the contour is the outline 
of a moving object. If these hypotheses fail, for instance if the contour is 
that of a person moving before a background of flowery wallpaper, the error 
is not important, because if the structure of the background was not suffi- 
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ciently pronounced for triggering off the contour mechanism, the eye will not 
be very critical, and will concentrate on the sharp, moving outline. 

As soon as the second spot, moving at double normal speed, has also caught 
up with the contour, it stops dead and the first spot starts moving with double 
speed. The amplitude distribution is now reversed, and the rule is simply 
that each spot contributes to the interpolated amplitude in proportion to its velocity. 
Hence the velocity plots in the diagram on the previous speed represent at 
the same time transmission or « weighting » functions for the two signal am- 
plitudes. 

But what if the second spot misses the contour? In this case, as illustrated 
in Fig. 3 it is preferable to modify the routine a little. The control mecha- 
nism is met in such a way that if on a distance Ax,,, from the stop of the first 
spot the second has not found the contour, the two velocities automatically 
return to their standard values, but without an overshot. The effect will be 
as if the contour had terminated below the first spot, but this is not as bad 
as if a spurious spot had appeared in the 
interpolated line at a distance + Ax, from 
Original the stopping point «,. 

It isa matter for experimental research 
to decide whether this routine (without 
overshooting), shall not be adopted also 
in the case when the second spot hits the 
contour. I see a certain advantage of the 
overshoot-routine in the point that it gives 
the correct amplitude in the contour itself, 
(as at a point in or very near to the con- 
tour the signals are divided (50-50) but 
this may be a minor advantage). 

Determining the distance Ax, + (or in 
the case of interpolation between frames 
the corresponding time interval), is of. 
course an important matter, where a compromise must be struck. In the case 
of interpolation between lines of a field we ought to make this distance, be- 
yond which the second spot gives up searching for a contour, equal to 6-8 point 
widths. This means that we interpolate only contours which are inclined more 
steeply than 1:1 or 1:4 to the horizontal. This is almost certainly sufficient, 
and the probability that there will be a spurious contour in such a narrow 


interval is not great if we make the critical (triggering) value of the intensity 
slope large enough. 


In the case of interpolation between frames this is a more critical and 


delicate matter. Ax, now corresponds to the maximum distance by which 


a contour has moved between two transmitted fields. In other. words Aw. ast 


1282 


PARETI 


TELEVISION COMPRESSION BY «CONTOUR INTERPOLATION ) 471 


sets an upper limit to the horizontal speed of moving objects for which our 
method is still effective. If we leave out every second frame, the time interval 
is .08 s, if we transmit one frame in four it is 0.16 s. Television producers 
tend to avoid fast movements, and a person crossing the screen in two seconds 
is probably the extreme case we need cater for. Taking the width of the 
TV screen as equal to 530 picture points, this means that an object of maximum 
speed moves 265 points in 1s, 21 points in 0.08 s and 42 points in 0.16 s. 
It is a matter for research to decide whether the chance of hooking on to 
the wrong contour will not be too great if we make Az,,,.=20 or 40 points. 
Perhaps it will be necessary to call in some further criteria if we want to 
identify corresponding points at such large distances. 

(Ball games will evidently not bear much compression, they take the pre- 
sent standards of 50 fields per second almost to the limit. Moreover tennis 
games are usually photographed from behind and above, so that the ball rises 
or fall across 50-100 lines in less than a second, and horizontal interpolation 
is of course unable to deal with this.) 


2. — Functional elements of contour interpolation systems. 


It will be useful to list first the functional elements: 


1) Input store. Data which arrive in succession through the transmis- 
sion line must be available simultaneously. (H.g. two lines in one field, or 
corresponding lines in two fields.) This necessitates some sort of storage organ. 


2) Atleast two sensing organs, such as moving spots, with an organ which 
decides when the contour mechanism shall be put into action. 


3) Velocity modulating organ, put into action when the previously men- 
tioned organ decides that a contour has been encountered. 


4) Signal distributor, which weights the data from the two sensing organs 
according to the scanning velocities, and adds them. 


5) Output store. This is necessary unless the interpolated lines, fields 
or frames are immediately transmitted. 


I propose testing the principle first in a rather slow device, suitable for 
picture transmission, at a speed about 1/1000 below that of television, which 
will be later described briefly. This is of interest in itself, because if it 
works satisfactorily it could be associated with conventional picture trans- 
mitters, such as used by newspapers for news photos, and would enable the 
users to buy only half of the radio time, perhaps even less. In this mechanical- 
optical device the input store is a photograph, containing only every second 
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line. The velocity modulator and the signal distributor are mechanical, actuated 
by photocells. The output store is a film or photographie print. 

The chief practical difficulty is in the stores. These will have to be either 
storage tubes, or magnetic drum storage systems. Storage tubes are available, 
but very expensive. Magnetic drum storage is very interesting, but will have 
to be devopled 

When starting on a venture like the present, it is good practice to think 
out the final consequences first, as far as possible, at least in general outlines, 
and fill in the details later. I will therefore describe first in very general terms 
the most ambitious system of which I dare think in advance; a television 
transmission system which gives a compression of 8:1 by contour interpo- 
lation alone (*). 


3. — A television transmission system with 9:1 compression. 


The diagram shows a full cycle of the processing of a TV signal which con- 
tains only one field in eight, stretched by some means, (not to be discussed 
here) so as to fill the whole time but only } of the normal waveband. In this 
diagram a single field scan is represented as an unbroken straight line. The 
explanation is contained in the sketch at the left. The zones allotted to the 
single lines are shown side-by-side instead of on top of one another. A stored 
picture is represented by a shaded area. The second field, which is usually 
interlaced with the other is shown on top of the first field. (In fact there is 
no need to interlace these in the storage organs.) 

The three received fields are recorded in succession in the three storage 

areas, which may be storage tubes or ma- 


n ES gnetic drums, and then the cycle starts 
SEEN afresh. As soon as two lines of a field are 
—>Time stored, line interpolation starts as indicated 

Fig. 4. schematically in the bottom left diagram | 

only. This again may be done, by means of | 

storage tubes or drums, but these need not hold more than 3 lines at a time. 
There are in every third of the full cycle two full stores, and the between: | 
frames interpolation is carried out between these two, four times in every | 
third of a full cycle. Both fields (the originally recorded and the interpolated 
one,) are scanned in Sequence, so that the result of the interpolation can be 
immediately transmitted, (without output storage), ready for reception by 


(*) This may then be combined with band-compression system by equalization of ! 
information rate, such as proposed C. CHERRY and G. G. GOURIET, and F. SCHROTER. . 
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ordinary television receivers. For special purposes, such as cinema or in- 


dustrial television it may be preferable to transmit the two fields sequentially, 
and repeat every frame to avoid flicker. This can be done by interlacing the 
two fields in the storage tube, and scan the whole frame with 405 lines in the 


à PAVAN DANN AN FEA 


| iaia È | 
i —»transmission or output storage 
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storage MM RNwaxy 
er DA 


transmitted field 
Fig. 5. — 8:1 compression system. 


time taken usually for 202.5 lines. (This of course requires twice the wave- 
band, but there is no objection to this in « closed circuit » TV systems.) 

The main application of compression systems is of course international in 
particular transatlantic television, where the cost of terminal equipment is 
negligible compared with that of the line. Here a compression factor of the 
order 20 appears desirable. Contour interpolation can contribute to this a 
factor of 4--8, and equalization of information rate a further factor of 3-4, 
hence the aim does not appear unattainable. Very high perfection is of course 
expected of every element in such an extensive reprocessing of the information. 
No noise or position error must be introduced by the repeated storing and 
reading of the pictures. 
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Intelligent Behavior in Problem-Solving Machines (*). 


H. L. GELERNTER and N. ROCHESTER 


International Business Machines Co., Research Center - Yorktown Heights, N.Y. 


1. — Introduction. 


Modern machines execute giant tasks in arithmetic and carry out clerical 
operations that are far beyond human capacity, but we have not yet learned 
to apply them to problems that require more than a barest minimum of in- 
genuity of resourcefulness. This paper reports some early results in an ap- 
proach to the problem of learning how to use machines in these presently un- 
manageable areas. The goal of this research is the design of a machine whose 
behavior exhibits more of the characteristics of human intelligence. 

We shall concern ourselves in particular with a single representative prob- 
lem; one which contains in relatively pure form the difficulties we must under- 
stand and overcome in order to attain our stated goal. The special case we 
have chosen is the proof of theorems in Euclidian plane geometry in the 
manner of, let us say, a high school sophomore. It must be emphasized that 
although plane geometry will yield to a decision algorithm, the proofs offered 
by the machine will not be of this nature. The methods to be developed will 
be no less valid for problem solving in systems where no such decision algo- 
rithm exists. 

Rejecting the application of a decision algorithm as uninteresting (in the 
case of plane geometry) or impossible (for most problems of interest), there 
remain two alternative approaches to the proof of theorems in formal systems. 
The first consists in exhaustively developing the proof from the axioms and 
hypotheses of the System by systematically applying the rules of transforma- 
tion until the required proof has been produced (the so-called « British Museum 
algorithm » of NEWELL and SIMON (**)). There is ample evidence that this pro- 


(*) This paper, presented at Varenna, had already been given by the Authors in 
the June 1958 for publication to the IBM Journal of Research and Development, where 
it effectively appeared in the Vol. II, No. 4, October 1958. (N.d.R.). 

(”) A. NEWELL, G. Ss. SHAW and H. A. SIMON: Empirical Explorations of the Logic 
Theory Machine, in Proceedings of the Western Joint Computer Conferences (February 1957). 
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cedure would require an impossibly large number of steps for all but the most 
trivial theorems of the most trivial formal systems. The last remaining alter- 
native is to have the machine rely upon heuristic methods, as people usually 
do under similar circumstances. 

Problems for which people use heuristic methods seem to have the following 
characteristic. The work begins routinely, and then suddenly the person ex- 
periences a flash of understanding. This is followed by the writing down and 
checking of the solution. What seems to be happening is that the person first 
uses heuristic methods to look for a solution. To each suggestion turned up 
by the heuristic methods he applies some sort of a test. The flash of under- 
standing comes when some suggestion gets a high score on the test. The 
clerical task that follows is the transformation from «suggestion space » Cc} 
to «problem space ». The transformation is possible, of course, only if a valid 
solution has been indicated. This is what the geometry machine does. 

Instead of geometry we might have chosen a certain class of probability 
problems, proofs of theorems in projective geometry, proofs of trigonometric 
identities, proofs in part of number theory, or the evaluation of indefinite 
integrals. There were, however, compelling reasons for choosing plane geo- 
metry, the most important being the readily understood « suggestion space » 
offered by the diagram (the semantic interpretation of the formal system), 
and the ease of transforming « proof indications » into problem space. This 
will be considered in detail later in the paper. An important secondary reason 
was the fact that everyone who would be interested in our results has studied 
Buclid, so the results can be communicated more efficiently. 

It should be noted here that the geometry project is a consequence of the 
Dartmouth Summer Research Project on Artificial Intelligence, standing on 
a foundation laid by the members of the study (*), and evolving from the 
pioneering work of NEWELL and Simon in heuristic programming [1]. 

Not all problems whose solutions seem to be accompanied by a «flash of 
understanding » are elementary enough to lie within the scope of the methods 
described below. Many have difficulties of a more profound nature. It will 
be possible to say a little more about this later, but a secure understanding 
of the nature of these harder problems will come only after more research has 
been done. 

The explanation of the precise meaning of the term «heuristic method » 
is an important part of this paper. For the moment, however, we shall con- 
sider that a heuristic method (or a heuristic, to use the noun form) is a pro- 
cedure that may lead us by a quick shortcut to the goal we seek or it may 
lead us down a blind alley. It is impossible to tell which until the heuristic 


(*) A. Newer and H. A. Simon have used the term «planning space ». 
(*) Particularly J. MCCARTHY, M. L. Mrysxy, and one of the authors (N.R.). 
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has been applied and the results checked by some formal process of reasoning. 
If a method does not have the characteristic that it may lead us astray, we 
would not call it a heuristic, but rather an algorithm (*). The reason for using 
heuristic instead of algorithms is that they may lead us more quickly to our 
goal and they allow us to venture by machine into areas where there are no 
algorithm (**). 

One final remark needs to be made. Since people seem to use heuristie 
reasoning in nearly every intelligent act, it is reasonable to ask why some 
task more familiar and natural for people was not chosen as representative 
of the class rather than plane geometry. Several alternatives to geometry 
were, in fact, considered and rejected for failing to satisfy one or more of the 
following requirements: 


1) The task must include a kind of reasoning that we are not yet able 
to get our machines to do but about which we have ideas and think 
we can learn to manage. 


2) It must not contain harder kinds of reasoning that are too far beyond 
our understanding. 


3) It must not be cluttered with too much irrelevant work. 


Most human acts fail to meet requirement 2). We have a long way to go 
before our machines can play Turing’s «Imitation Game» and win [2]. 


2. — Geometry. 


A standard dictionary defines « geometry » as «the theory of space and of 
figures in space », and indeed, most people would offer a similar definition. 
To the mathematician, however, geometry represents a formal mathematical 


() A decision procedure applied under the constraint of a time limit behaves as. 
if it were a heuristic. 

(”) There are classes of problems, for example, proofs of theorems in number theory, 
where it can be shown that no decision procedure can be devised. Heuristic procedures 
should enable us to get machines to solve problems that are members of such classes. 
It should be evident that no set of heuristics together with the programs to employ 
them can guarantee that a machine will solve every member of such a class. All that 
a machine can do is to probe around and perhaps come up with an answer. This, — 
of course, is all that people can do. It should be evident, too, that a program uti- 
lizing heuristics can perfectly well be an algorithm that is guaranteed to solve any 
member of some class of problems. Such a class must, of course, be amenable to a de- 
cision procedure. The contribution of an individual heuristic here is that it may lead 


: short cut. The geometry theorem machine will probably be an algorithm of this 
kind. 


1288 


res > 


À 
jo 


INTELLIGENT BEHAVIOR IN PROBLEM-SOLVING MACHINES 477 


system within which proofs are possible, and which can be related to real 
space if this seems interesting for the purpose at hand, but which can alterna- 
tively be related to concepts having no physical reality or significance. The 
machine considers geometry primarily as a formal system but uses the inter- 
pretation in terms of figures in space for heuristic purposes. 

A formal system such as geometry comprises: 


Primitive symbols. 
Rules of formation. 


Well-formed formulas 


) 
) 
) 

4) Axioms. 
) Rules of inference. 
) 


Theorems. 


The set of primitive symbols (or alphabet) for geometry are those cha- 
racters which are interpreted as the names of points together with those inter- 
preted as specifying relations between discrete sets of points, or between a 
given set and the universe of points (e.g., =, |, 4, B, A). In order to make 
proofs in geometry it is for example, not necessary to think of a line as some- 
thing long, thin, and straight. It is sufficient to be able to recognize the 
symbol « line ». 

The rules of formation specify how to assemble the primitive symbols into 
well formed formulas (statements) which may be valid or invalid within the 
formal system. For examples, « Two sides of every triangle are parallel » is 
a well formed formula (although not valid), whereas « Two exists of obtuse 
every one point» is not a well formed formula. We can ask the machine 
whether the first is true (interpreting formal validity as truth), but the second 
is gibberish because it does not obey the rules of formation. These rules are, 
in a sense, the grammar of a language whose vocabulary comprises the alphabet 
of primitive symbols. 

The axioms are a set of well formed formulas such as « Through every 
pair of points there can be drawn one and only one straight line » which are 
selected to serve as a foundation on which to build. They are regarded as 
being true by definition, if you like. 

The rules of inference are the means by which the validity of one well- 
formed formula can be derived from others that are already established. The 
new formula is said to be immediately inferred from the given one or set by 
the specified rule of inference. 

A proof is a suecession of well-formed formulas in which each formula 
(or line of proof) either follows by one of the rules of inference from the pre- 
ceeding formulas, or is an axiom or previously established theorem. A theorem 
is the last line in a proof. 


478 H. L. GELERNTER and N. ROCHESTER 


To recapitulate, a problem presented to our machine is a statement in £& 
formal logistic system, and the solution to that problem will be a sequence 
of statements each of which is a string of symbols in the alphabet of that 
system. The last statement of the sequence will be the problem itself, the 
first will always be an axiom or previously established theorem of the system (*), 
Every other formula will be immediately inferable from some set preceding 
it, or will itself be an axiom or previously established theorem. 

This simple and elegant description of geometry is essentially the one given ta 
the high school sophomore. It will shortly be seen that this view is too naive 
to describe what really happens, but for the moment it will be expedient to 
continue the exposition as if it were true, because the idealization has a signi- 
ficance of its own. There are a number of things to be pointed out about this 
ideal view of geometry. For one thing, there is a difference between finding 
a proof and checking it. To check a proof one merely follows some simple 
rules that are set down very precisely. To discover a proof, on the other hand, 
requires ingenuity and imagination. One must use good intuitive judgement 
in selecting which of many possible alternatives is a step in the right direction. 
The high school sophomore does not have a complete set of explicit rules to 
guide him in finding a proof. 

Since the checking of a proof is a clerical procedure there is no reason why 
a machine cannot easily do it. A well-formed formula (i.e. axiom, line of a 
proof, or theorem) would be a string of data words in memory, and a rule of 
formation or of inference would be a subprogram. There is nothing really 
new or difficult about this, and many programs have been written to make 
machines do jobs as difficult. The artificial geometer will have a subprogram 
which is an algorithm for checking a proof. 

The process of discovering a proof is another matter, and the question of 
how to get a machine to do it is the subject of this paper. The student or 
the machine can be given some useful hints, but must also be provided with 
a warning that these hints may be misleading. For example, it can be said 
that if the proposition to be proved involves parallel lines and equality of 
angles there is a good chance that it will help to try the tehorem: È 

«If two parallel lines are intersected by a third line, the opposite interior 
angles are equal ». 

This advice is a heuristic that can be given to the machine or student. 


It will lead to a proof in a good many cases, but will as often lead 
nowhere at all. 


() In the case of a theorem contingent upon a set of hypotheses, the proof is deve- 
loped in an extended System in which the hypotheses are appended to the original set 


of axioms. The transformation of this categorical proof to the desired hypothetical 
one is trivial. 
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Thus far, there has been no mention of drawing figures. It is of course 
quite possible to discover a proof in a formal system without interpreting that 
system, and in the case of geometry, except for the need to discover proofs 
efficiently, or for applying theorems to practical problems, one need never 
- make a drawing. The creative mathematician, however, generally finds his 
most valuable insights into a problem by considering a model of the formal 
system in which the problem is couched. In the case of plane geometry, the 
model is a diagram, a semantic interpretation of the formal system in which, 
to quote Euclid, the symbol «point » stands for «that which has no parts », 
a «line » is « breadthless length », and so on. The model is so useful an aid 
for discovering proofs in geometry that few people would attempt a proof 
without first drawing a diagram, if not physically, then in view of the mind’s 
eye. If a calculated effort is made to avoid spurious coincidences, then one is 
usually safe in generalizing any statement in the formal system that correctly 
describes the diagram, with the notable exception of those statements con- 
cerning inequalities. 

We cannot emphasize too strongly the following point. To serve as a heu- 
ristic device in problem solving, it is not necessary that the model lie in 
rigorous one-to-one correspondence with the abstract system. It is only ne- 
cessary that they correspond in a sufficient number of ways to be useful. The 
success of the model in designating correct solutions to problems in that system 
(solutions that will be checked within the framework of the abstract system) 
is the only criterion one need apply in judging the suitability of a given 
model (*). 

If the model is indeed a semantic interpretation of a formal logistic system, 
then it is most desirable that the interpretation satisfy every axiom of the 
formal system. But should the interpretation be valid too for some richer 
formal system (or poorer one, for that matter), its heuristic value might be 
impaired, but by no means eliminated. 


3. — Heuristic methods. 


The proof of theorems in Euclidian plane geometry in the sense described 
above requires the extensive use of heuristic methods, and it is these methods 
rather than geometry that are of primary interest to us. The role of geo- 


(*) A. NEWELL and H. S. SIMON, in private communication with the authors, have 
described an abstract model for a propositional calculus which is not a semantic inter- 
pretation but which, in fact, is another formal system in which it is trivially easy to 
prove the transformed theorems. Since this is a true heuristic, it is not always pos- 
sible to transform it back to the problem space. 
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metry is to provide a problem of the right difficulty to permit a thorough 
development and understanding of the class of heuristic involved. 
The steps in a typical application of a heuristic method to theorem proving 

are the following: 

1) Calculate the character (*) of the theorem. 

2) Using the theorem character, calculate the applicable methods and 

estimate the merit of each. 

3) Select the most appropriate method. 
je Try: ik 
5) In case of failure, cross off this method and return to step 3). 
) 


fon) 


In case of success, print the proof and stop. 


The character of a theorem (or of any problem) is in essence the machine’s 
description of the theorem (or the problem). In its simplest form, the character 
may be represented by a vector, each element of which describes a given pro- 
perty of either the syntactic statement of the theorem or its semantic repre- 
sentation. The vector designating the applicable methods and estimated merit 
of each is a vector function of the character. The figures of merit are, of course, 
only guesses based initially on the experience of the programmer, and sub- 
sequently modified by the machine in the light of its experience. 

Defining the term characteristic as a given element of the character vector, 
the following might be introduced as syntactic characteristics of a theorem: 


CU; = 1 if the hypotheses contain the symbol |, 0 otherwise ; 


C; = 1 if the consequents of the theorem contain the symbol |, 0 
otherwise; 


Ux — 1 if there exists a permutation of the names of points in the 
hypotheses that leaves the set of hypotheses unchanged, 0 other- 
wise; and so on. 


Examples of semantic characteristics are the following: 
C, = n; where n is the number of axes of Symmetry in the diagram; 


Cn = 1 if two angles of segments are to be proved equal, and they 


are corresponding elements of congruent triangles, 0 otherwise; 
and so on. 


(*) The term character was introduced by MinsKy (M. L. Minsky, Heuristic Aspects 
of the Artificial Intelligence Problem, Lincoln Laboratory Report 34-5 December 1956), 
and is to be understood in its dictionary sense. The particular machine representation 


of a theorem character selected by the authors differs somewhat from that of Minsxy, 
but this important concept is due to him. | 
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The rules formalized into the vector function that transforms the character 
of a problem into a sequence of designated methods of approach and the esti- 
mated merit of each will in general fall into two categories. The first will 
contain those heuristics which operate on the syntactic characteristics of the 
problem. The second will, in the general case of a problem-solving machine, 
comprise those rules which operate on the characteristics of the model. For 
the artificial geometer, these are the semantic characteristics of the model as 
described above. 

The problem of strategy and tactics in choosing methods is most important. 
One obvious strategy mentioned earlier is to explore all alternatives syste- 
matically, and this is known to be inadequate for many problems and is con- 
sidered by the authors to be uninteresting and probably useless for geometry. 
The strategy and tactics used by NEWELL and SIMON in their achievement 
in theorem proving by machines are not adequate for this harder problem on 
present day machines. Their proofs were at most, three or four steps long 
and machine time required is probably an exponential function of the number 
of steps. Clearly the ten step proofs of geometry will require much more se- 
lective heuristics than those adequate for propositional calculus. 

The authors have at present a system of strategy and tactics. It does not 
seem useful to report it in detail at this time because machine experience will 
probably induce major revisions and improvements. It is clear, however, that 
the skill with which the machine selects and manipulates methods will dis- 
tinguish a good machine from a poor one. Since it is impossible to predict 
the detailed behavior of so complex an information processing system as the 
artificial geometer, it is necessary to write the program and run the simulation 
before conclusions can be reached with confidence. 

The speed with which a difficult problem can be solved is an essential 
factor in determining the usefulness of an intelligent machine. This speed 
cannot be achieved by little steps like inventing faster components. On the 
scale considered here a factor of ten is a minor change in speed. Suppose, for 
example, that a given proof requires ten steps. If for each step, the machine 
must explore three alternatives, there will be about 20000 things to consider. 
A slightly less intelligent machine that must explore six alternatives will have 
to consider 20000000 things. For problems having longer solutions, selec- 
tiveness becomes more important exponentially. 


4. — Syntactic symmetry. 


The formal system of plane geometry will be a difficult one for the machine 
to manipulate. Not only are the alphabet and axiom set both large, but geo- 
metry must be formalized in the lower functional calculus, at the very least. 


31 — Supplemento al Nuovo Cimento. 
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The difficulty is compounded, too, by the fact that the predicates of ian 
geometry exhibit a high degree of symmetry, and a given Rhaleriane in the 
system will in general admit a multiplicity of completely equiv eens forms. 
These symmetries are at times a painful thing to contend pu, they make 

it necessary that a theorem be considered in every one of its equivalent forms 
in seeking to establish à deduction by means of substitution. On the other 
hand, they are the basis of a powerful new rule, completely syntactic in nature, 
that simplifies immensely the search for a proof of a theorem displaying thesg 
symmetries. The rule will prevent the machine from searching in a circle 
for useful intermediate steps, or subgoals, to bridge the gap between 
antecedent and consequent of the theorem 


8 c to be proved. In effect, it removes from con- 
sideration those subgoals which are formally 
equivalent to some subgoal that has already 

EN been incorporated into the structure of the 
A D 


search for a proof. 

We shall introduce the rule by an example. 
Let us consider the following theorem: The dia- 
gonals of à parallelogram bisect one another (Fig. 1). 

To solve the problem, the machine must establish the formulas : 


Rioni: 


AE=EC, 
_ and 
BE — ED. 


Now it would be most useful if the artificial geometer could recognize, as. 
people usually do, that the proof of the second formula is essentially the same 
as that for the first, and therefore only one of the two need be established. 
But it is even more important that the machine fall not into the class of trap 
illustrated by the following redundant search process. The method chosen is 
that of congruent triangles, and in order to establish the formula AI ~ AIT 
from which the theorem may be immediately inferred, the machine sets at 
some later stage the subgoal AITI~ AIV. The geometer will, in fact, satisfy 
our requirements on both these points. The mechanism whereby this is ac- 
complished is an embodiment of the theorem and rule specified below. 

Consider first the following definition: Let x be a permutation on the 
names of the syntactic variables in a theorem. Then z is a syntactic symmetry 
of the theorem if its operation on the set of hypotheses leaves the set unchanged 
except for a possible transformation into an equivalent form with respect to 
the symmetries of the predicates (i.e. t{H} ={H} is valid). We can now state 
the required theorem thus: 


If J’ is a well-formed formula provable from the set of hypothese {H}, 
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then x/' is a well-formed formula provable from the same set {H}. The for- 
mula J’ will be called the syntactic conjugate of I’. The proof of the theorem 
is quite trivial, and follows from the fact that the syntactic variables in a 
theorem may be renamed without destroying the validity of the theorem. 


Thus, if (REI Em È is valid, 
then x{H} > ar follows by the rule of substitution. 
Since a{H} ={H} 


HS al is valid. 


The theorem itself grants the machine the same power the human mathe- 
matician has at his disposal when he recognizes the equivalence of two dif- 
ferent statements with respect to a given formal system, ior now it may 
establish the syntactic conjugate of any valid formula J’ by merely asserting 
«similarly xl'». The rule of syntactic symmetry follows from the theorem. 
It is used by the machine to 
construct, given the heuristics e 
and methods at its disposal, 
the optimum problem-solving 


graph, and a description of poi ie p> 
such a graph is in order at 

. . 1 2 3 
this point. Gi Gi 6; 


Let G, be the formal state- 
ment to be established by the 
: proof. It will be called the (AE pe) [pet ship ITS 
problem goal. If G, is a formal 2 
Statement with the property 3 : ; 
that G,_, may be immediately Ge 0 fe 6, 6, 
inferred from G,, then G, is i 
said to be a subgoal of order à i \ / PN / wt \ 
for the problem. All G; such 
that j<è are higher subgoals Fig. 2. — The nodes Gf represent subgoals of order 
than G;, where G, is consi- 5%, with « numbering the subgoals of a given 
dered to be a subgoal of or- order; P is a transformation on G% into GÉ_,. 
der zero. The problem solving 
graph has as nodes the G;, with each G, joined to at least one G,_, by directed 
link. Each link represents a given transformation from G, to G,,. The 
problem is solved when any G, can be immediately inferred from the hypo- 
theses and axioms (*). 


(*) The completed proof will use a deduction metatheorem to get + {H} DG 
from {H}+ Gp. 


| 
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We can now specify the rule of syntactic symmetry thus: G; is not a 
suitable subgoal to add to the problem solving graph if it is the syntactic conju- 
gate of any G; for è > j, for any proof sequence leading to G, is identical with 
a conjugate sequence leading to &; with the variables renamed, and any me- 
chanism leading to a proof of G, would as well prove G;. If i=j, the two 
subgoals are in effect redundant, and if i>j, the sequence leading to G; 
leads to G; when conjugated, and all the steps @,, j > k >j can be eliminated. 

In the light of the above, we may now re-examine our introductory problem 
(Fig. 1). The machine must establish the following two goals: 

Gi: 


0 


AE=EC, 
G: BE=ED. 


By the theorem of syntactic symmetry, the machine will eliminate G, from 
the graph, since @ =2G;, where x is the transformation A into B, B into C, 
C into D, and D into A, and after proving Gi, will assert «similarly, G ». 
Then, if at some point in the proof AABE = ACED is a subgoal, it will elimi- 
nate the statement ABCH~ ADEA as a possible subgoal; if AB= CD is a 
subgoal, BC= DA will be removed from consideration. Clearly every di- 
rected path through the problem solving graph from hypotheses to goal will 
be unique under the z transformation, and will be the shortest one in that 
it will contain no redundant sub-graphs (no two nodes will be linkable by a 
x transformation). 

Syntactic rules such as the above will be essential to the success of the 
plane geometry machine. But while they ease the labor of the geometer con- 
siderably as it threads a path from problem to solution, they are, except in 
the simplest cases, powerless to indicate which path, among the very many 
possible, does indeed lead to a solution, and which wander off into infinity, 
regressing farther from the goal with each step. The geometer will need more 
information about most problems before it can even begin to seek a solution. 
It will find the information as the mathematician does, in the diagram. # 


5. — Semantic heuristics. 


Semantic heuristics are concerned with the body of pertinent and probably 
true statements that can be obtained by observing the diagram. For example, 


one of the first such rules to be applied the by geometer in a particular case 
will be the following: 


If the diagram consists of a « bare » simple polygon, a construction will 
probably be required. 
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A rule to indicate which construction to make might be: 
If the figure has one axis of symmetry, and it is not drawn, then 
draw it. 


A most useful rule will be: 
If the theorem asks that two line segments or angles be proved equal, 
determine by measuring whether these are corresponding parts of 
apparently congruent triangles. If so, attempt to prove the con- 
gruence. 
If necessary, draw lines connecting existing points in the diagram in 
order to create the congruent triangles. 


Another frequently used heuristic will be: 
If two apparently parallel lines are crossed by a transversal, attempt 
to establish the parallelism by considering the angles. 


A more complete understanding and appraisal of the appropriate heuristics 
will be one of the major consequences of experimentation. 

It should be clear that the best set of heuristic rules, in other words the 
set that is the best compromise between conciseness and efficiency, should 
not be expected to yield the best proof in every case. Indeed, in a number 
of awkward cases the rules will impede, rather than aid, the search for a con- 
cise proof. In some cases the machine will make a construction and produce 
an elaborate proof while missing a simple elegant proof. People, too, do this. 
But these awkward cases should be the exception, and the heuristic rules look 
powerful enough to make an efficient machine. 


6. — Rigor. 


Mathematical rigor becomes a significant matter in two different aspects 
of the artificial geometer. One of these is that machines can provide, in a 
sense, more rigorous proofs than have hitherto been available. More impor- 
tant than this is the second aspect, and this is that the machine is like a good 
human mathematician, in that it increases its output and improves its com- 
munication with other mathematicians by taking chances with rigor. 

Axioms and theorems are objects that can be examined and manipulated 
by people and machines. These present no problem. However, methods of 
inference are instructions to do something. In the case of machines they are 
programs of instructions in machine language. In the case of people they are 
instructions expressed in a natural language and intended to control human 
behavior. Except for undetected blunders in design of a machine or in the 
writing of a set of machine instructions, the machine and its instructions are 
fully understood. And when one of these blunders is detected, it causes merely 
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annoyance and not bewilderment. Therefore when a machine proves a theorem, 
there is in principle, no doubt about what is going on, and except for pos- 
sible apprehensions about human blunders or undetected machine malfune- 
tions there is no doubt about rigor. 

In contrast the human situation is rather poor. While much is known 
about the human brain, the basic principles of operation are still unknown. 
Perhaps some of today’s conjectures are correct, but we have no sure way to 
select the correct conjectures from among the various contradictory proposals. 
Furthermore, natural languages are not yet perfectly understood and again 
there are contradictory theories. It therefore seems unwise to rely on the 
rigor of any system based on such a welter of ignorance. 

It is interesting to observe that the most rigorous treatments of the foun- 
dations of mathematics seem equivalent to designing a machine and a machine 
language and henceforth communicating in this language. In one case [3] 
the mathematician even uses the term « machine », although his machines could 
not actually be built because they contain parts with infinite dimensions. 
Other really good treatments do not use the word «machine » but are essen- 

tially equivalent. It should be clear then that 
B the translation of a formal system into a machine 
program is reasonable and natural. 
The other aspect of rigor is quite different. 
Most elementary textbooks on geometry fail to 
prove betweenness relations. In Fig. 3, the acute 
angle ABC is bisected by the line segment BD. 
A E C The line segments BD and AC are extended to 
Fig. 3. — Bisected angle. infinity, thus becoming lines BD and AC. Point 
E is defined as the intersection of lines BD and AC. 
Now how can it be determined whether point Æ lies between À and C or to 
the left of A or to the right of C? 

Ordinarily this decision is made by looking at the figure. In rigorous treat- 

ments it is proven formally, but this is a tedious effort and except for meta- 


mathematical considerations, not really necessary. Expediency dictates that” 


the mathematician should neglect the possibility that semantic heuristics will 
lead him astray and get on with the work rather than dally over proofs of 
betweenness. Because people rarely get in trouble because of honest errors 
of this kind, traditional geometry excludes proofs of betweenness, and most 


mathematics appear to lack rigor because many matters are settled by heuristic | 


methods rather than formal proofs. It seems clear that the machine must be 
able to work this way if it is to become proficient. 

The artificial geometer decides questions of betweenness by measurements 
on the figures. But whenever it does so, it explicitly records the necessary 
assumptions for a given proof so as to leave a record of its guesses. There 
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is, of course, a danger that the machine will be proving only a special instance 
of the theorem presented to it, but this danger can be minimized by having 


the machine draw alternate diagrams to test the generality of its assumptions 
when they are necessary. 


7. — Programming the geometer. 


The organization of the program falls naturally into three parts; a « syntax 


computer » and a «diagram computer » embedded in an executive routine, the 
formal system and its purpose is to 

pee Na 
canonical language which should be lA Saba) 
computer can submit any sequence of 


« heuristic computer ». The flow of control is indicated in Fig. 4. 

establish the formal proof. Geometry 

useful for a much wider range of formal pig. 4. — Flow chart of the artificial 
lines of proof to the syntax computer which will test them to see if they 


The syntax computer contains the 
is expressed in a Post-Rosenbloom 
systems than geometry. The heuristic geometer. 


are correct. 

The diagram computer makes constructions and measures them. It does 
this by means of geometry and floating point calculations. However, it keeps 
all this secret from the heuristic computer and reports only qualitative infor- 
mation of the type acquired by a mathematician in scanning a well-drawn 
figure. The behavior of the heuristic computer and the syntax computer would 
not be changed if the diagram computer were replaced by a machine that 
could draw diagrams on paper and observe them. 

The heuristic computer does most of the things that have been discussed 
in this report. It contains the heuristic rules and decides what to do next. 
The subordinate computers only follow its instructions and answerits questions. 

The program is being written in an information processing language con- 
structed by appending a large set of special functions to the Fortran compiler 
for the IBM 704. The language increases manyfold the ease of writing prog- 
rams of the nature of the geometer, and will be reported upon in detail in a 
subsequent paper. 


8. — Learning in intelligent machines. 


The machine described thus far will exhibit intelligent behavior, but it 
will not improve its technique. Except for the annexation of previously proved 


theorems to its axiom list, its structure is static. A rigorous sequence of 


1299 


MESSA « 


488 H. L. GELERNTER and N. ROCHESTER 


practice problems will not improve its performance at all in solving a given 
problem unless a usable theorem is among them. Such a machine, incapable 
of developing its own structure will always be limited in the class of problems 
it can solve by the initial intent of its designer. It seems that the problem 
of designing a machine of general intelligence will be enormously greater, if 
at all possible, than designing a not so intelligent one with the capacity to 
learn. 

One might attempt to endow an automaton like the geometer with the 
ability to learn at various levels of sophistication. Indeed, the behavior of the 
machine in storing away for future use each theorem it has proved may be 
interpreted as learning of a rudimentary sort. This might be refined by having 
the machine become selective in its choice of theorems for permanent storage, 
rejecting those which do not seem (by some well-defined criteria) to be suffi- 
ciently «interesting » or general to be useful later on. Similarly, instead of 
« forgetting » all lemmas it might have established as intermediate steps in 
the proof of the theorems offered to it, to be rederived when needed, the ma- 
chine might select the especially interesting ones for its list of established 
theorems. 

The next level of learning is indicated when the machine adjusts, on the 
basis of its experience, the probability for success it assigns to a given heuriste 
rule for a theorem with a given character. This is the learning involved when 
the machine uses results on one problem to improve its guesses about similar 
problems. As the geometer is given problems of a given class, say problems 
about parallelograms, it would get better at handling them. After it had been 
given a graded sequence of harder and harder problems, its performance should 
be much better and it could be said to have learned to prove parallelogram 
theorems. The highest level to which we aspire for an early model geometer 
will be involved when it looks over the quality of its predictions and discards 
as irrelevant some of the criteria that comprise the problem character. The 
earliest models of the geometer will include only low levels; later models will 
be better. | 

Beyond these kinds of learning we can see other things. Before we come 
to them, however, we will probably be working on machines to solve harder 
problems than those of geometry. There are kinds of learning that are needed 
only by machines that take their environment more seriously than theorem 
proving machines do. These will be discussed in the next section. But we 
can hope that a theorem machine might some day be able to observe that 
some sequence of methods was effective in certain circumstances, and con- 
sequently streamline the sequence into a single method and in this way devise 
a new method. 

. But in still another vein there are possibilities for theorem machines. Instead 
of providing a machine with a formal sytem and a Sequence of propositions 
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to prove, it could be given a formal system and be asked to see what it could 
find. Here it would at least need criteria for the utility of theorems in proving 
other theorems and for the elegance of a proof in terms of large achievements 
in a small number of stops. New kinds of learning would be used here. 

Before closing the subject of learning machines, there are some further 
considerations to deal with. A computer is, after all, just a finite automaton, 
and, as such, its behavior is completely determined by its internal state at 
the beginning and subsequent input information. This being the case, it can 
be argued that its response to any set of input signals is in principle predictable 
and is consequently uninteresting and not worthy of the description « intel- 
ligent ». Another version of this objection is the following. The machine, 
endowed with heuristics and judgements of its designer, is but a trivial ex- 
tension of that person, in principle no different from a slide rule in the hands 
of an engineer. 

From a certain irrelevant point of view the objection is justified, but in 
practice the behavior of the machine is far from being predictable. That this 
is indeed the case is well illustrated by the fact that the geometer, its ope- 
ration simulated by «hand », has on several occasions produced a proof that 
was a complete surprise to its programmers. The nature of an intelligent 
program is such that unlike a conventional arithmetic computation, in which 
the branches are few and easily traceable, the number of conditional branches 
depending on the input are bewilderingly many and highly interdependent, 
rendering impossible any detailed attempt to trace its behavior. And of course, 
once learning is introduced into the program, it will constantly modify itself 
in a highly complex way, so that while its behavior is still in principle deter- 
mined, one will become increasingly powerless to predict its response in any 
given case. In a very real sense, the machine’s proofs will be no more or less 
trivial than those offered by the neophyte mathematician who is still under 
the influence of his professor. 

One may view this machine in still another way. At any instant of time, 
the internal configuration of our machine is some particular state of a finite 
state automation. Then of the infinite number of sequences that one might 
ask the machine to establish as theorems, some infinite subset of these will 
be provable by it. At any given time, our machine represents a partial de- 
cision method over this infinite set of theorems, and this set will be richer in 
« interesting » theorems than a random subset of all theorems. The class of 
theorems considered «interesting » will determine the heuristics that control 
the partial decision method, and in turn, the density of interesting theorems 
in the set enumerated by the machine will depend on the apt choice of the 
heuristics. It is important to note that if even the most rudimentary learning 
behavior is built into the machine, its initial internal configuration will be 
different for each new problem presented to it, and consequently, the class 


1301 


490 H. L. GELERNTER and N. ROCHESTER 


of theorems decidable by the machine will be continually changing. And what 
is any human mathematician but a partial decision machine over some unknown 
class of theorems? 

It is possible to approach the problem of theorem proving by machine from 
a rather different direction. E. W. BETH describes a method (« semantic 
tableaux ») for systematically constructing a counter-example for a proposed 
theorem if there is one, or else establishing the fact that none exists [4]. If 
it can be shown that a counter example cannot be constructed, an algorithm 
is given for converting the « closed » semantic tableaux produced into a proof 
of the theorem in the formal system. But the method of semantic tableaux 
is essentially an enumeration procedure—in this case, it is the set of individual 
instances of the theorem that could possibly be counter examples to the 
theorem that is being enumerated, and like all such procedures, the bulk of 
calculation required rapidly outdistances the capacity of conceivable com- 
puting machines. In order to make the procedure reasonably efficient, heuristic 
rules for the control of the enumeration must be introduced, and one is faced 
with essentially the same problem that concerns the body of this paper. The 
more or less anthropomorphic approach followed by the authors has the ad- 
‘vantage that suitable heuristics are readily suggested by introspection and 
the methods developed are more likely to be applicable to the solution of 
problems in non-formal systems. 


9. — The theory machine. 


At various points in the preceding discussion, a line of reasoning was ter- 
minated by the comment that harder problems exist but they are outside the 
scope of the matter being considered. This large new class of problems and 
how a machine can handle them is the subject of this section. We consider 
now a machine that takes its environment more seriously. 

The subject will be introduced by an example of a more advanced kind 
of geometry machine, a machine that tries to learn what kind of geometry 
fits the environment it finds around it. The heuristic computer is provided 
with an environment by the diagram computer. It looks to the environment 
for heuristics, for clues about. what to do next. However, if it learns that some 
measurement contradicts something that it can prove in the syntax computer, 
it assumes that the measurement is in error. In other words the formal system 
is sacrosanct. 

Now suppose that the diagram computer is replaced with another that 
does its drawings on the surface of a sphere. Suppose further that the prio- 
rities in the heuristic computer are readjusted so that it believes the diagram 
computer rather than the syntax computer when the two are in conflict. Sup- 
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pose also that it is provided with the means to modify the formal system and 
additional heuristics to enable it to do so efficiently. It would be arranged so 
as to try to bring theory (the syntax computer) and experiment (the diagram 
computer) into harmony, and thereby discover what kind of a world it lives in. 
This is a theory machine. 

There seems to be, in principle, no reason why a theory machine should 
not be fitted with the means to do experimentation, a tool room, a stockroom, 
and an instrument room, and told to work out the theory of something or 
other. In practice, there is the familiar difficulty of speed and cost. Today 
it is cheaper and quicker to use people to do research, but perhaps someday 
machines will do the research and people will merely control the doing of 
research. This is precisely parallel to the digging of excavations. Once people 
did it, but now machines do the digging and people merely control the ma- 
chines. The scientist using a machine to do research would have a role ana- 
logous to that of a professor at a university directing his graduate students. 

A further conjecture along this line relates to programming. A person 
finds it much easier to communicate a complex message to another person 
than to a machine. Speaking is relaxed and easy while writing a program of 
machine instructions is detailed and exacting. When one person listens to 
another he often fails to interpret some word correctly for a while but later 
some other words enable him to understand the earlier word. It seems as if 
the listener is continually generating hypotheses about what the speaker means 
and is continually checking these hypotheses and accepting them or rejecting 
them and casting about for others. In terms of human activity, theorizing 
is much too pretentious a word for this activity. However, from the point 
of view of machine design, it may be that only a theory machine will be easy 
for people to instruct. 

The interaction between formal and heuristic procedures in a theory ma- 
chine is more intricate than in a theorem machine. To determine the conse- 
quences of its present hypothesis the theory machine must use the methods 
of the theorem machine. Because of the different nature of the typical prob- 
lems it will be solving, the theory machine must lean more heavily on semantic 
heuristic as a substitute for rigorous deduction. Then when it finds a discre- 
pancy between theory and experiment it must use both rigorous deduction 
and heuristic procedures to modify its formal system. It is an interesting 
feature of such a machine that the rules for formal deduction used to modify 
the formal system are actually part of the formal system. This is not an un- 
reasonable situation; it is essentially what happens when the program for a 
calculator causes the calculator to modify the program. However, it surely 
is complicated, and the complication does not end here. 

The machine described so far resembles a theoretician with little or no 
experimental skill. Additional heuristic is required to enable the machine to 
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select a clean experiment that will be an effective test of a theory. Contin- 
gencies will arise in the experimentation, and the machine must handle these 
as subproblems. In other words it must invoke this whole apparatus over 
again at a lower level. 

The theory machine is a device that conjectures about its environment 
and tests its conjectures. In so doing it gains an increased understanding of 
what is going on. It is hoped that not only will the theory machine be able 
to do research, but will also be easier to communicate with than a present 
day automatic calculator. 


10. — Summary. 


In contrast to the present use of automatic calculators which outperform 
humans in clerical tasks, the theorem machine is advanced as a device that 
reasons heuristically. It is therefore able to solve harder problems, and the 
study of it reveals some things about the nature of problems and of machines. 
The essential operating principle of this kind of artificial intelligence is that 
it has a formal part, a syntax computer that can make deductions, and a 
heuristic part that can make guesses. By using the syntax computer to test 
the guesses made on a heuristic basis, the machine is able to get results that 
are beyond the scope of a purely deductive machine. 

Heuristic processes can be syntactic, that is depending on the language 
in which the problem is stated, and on the statement in that language, or 
they can be semantic and depend upon an interpretation or model of the 
formal system, in other words, an example. 

The artificial geometer is an example of a theorem machine. Geometry 
was chosen, not because of any inherent interest, but rather because it pro- 
vides an example of a problem at the right level of difficulty that needs se- 
mantic heuristic in a major way. It is being pursued by simulation on the 
Type 704 Electronic Data Processing Machine. | 

An interesting aspect of geometry is that as taught in high school, it is 
not rigorous. Some facts are established by proving them and some by ob- 
Serving the figure (i.e. semantic heuristics). This is a powerful, effective method 
of reasoning used by people and by the artificial geometer. While it would 
be possible, and probably easier, to make the artificial geometer perfectly 
rigorous, it is more significant in the study of artificial intelligence to avoid 
the strictness of rigor that is a proper part of metamathematies but not effi- 
cient in mathematics. 

Beyond the theorem machine is the theory machine which, by conjecturing 
and testing the conjectures, gains an understanding of its environment. Such a 
machine should be able to do research and should be easier to communicate with. 
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The largest obstacle to the development of useful theorem and theory 
machines is the problem of speed. This cannot be cured by faster components 
alone. The major contribution to speed must come from improved heuristic 
so that the machine will waste less time in fruitless endeavor. The nature of 
hard problems insures that the machine must waste some time on wrong hunches 
but the waste must be kept within bounds. The machines themselves are 
expected to make a major contribution to the understanding of artificial intel- 
ligence because they learn as they work, and what they learn reveals much. 


In closing the authors wish to acknowledge the contributions of A. NEWELL, 
J. MCCARTHY, M. L. Minsky, and H. A. Simon whose relation to the project 
has been indicated in the text; and to C. L. GERBERICH, J. R. HANSEN, and 
R. M. KRAUSE, whose technical and programming contributions are making 
the project possible. Professor McCARTHY in particular has been playing a 
continuous role as consultant to the authors. 
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Introduction. 


We begin our consideration of the linguist’s approach to the problem of 
speech communication by inquiring into the nature of the data that constitute 
the subject matter of linguistics. We want to know what kind of problems 
are of special interest to the linguist, for only if we understand this will we be 
in a position to appreciate the reasons for the ways of the linguist which fre- 
quently seem strange to the outsider. 

As a first answer it might be proposed that linguistics is concerned with 
characterizing the class of acoustical Signals which men make in speaking. 
The natural way of going about this would be by investigating in detail the 
anatomical structures in man that make it possible for him to emit this special 
set of signals. One would investigate the human vocal tract: the larynx, the 
pharynx, the nasal cavity, the mouth, the tongue, the lips, ete., and one would 
attempt to make statements about the motor capabilities of these organs. 
Once one had learned all there is to know about these physiological aspects 
of the problem, and, provided one knew a great deal of acoustics, one could 
give the desired description of the acoustical Signals which such a mechanism 
was capable of emitting. One might further investigate the analogous mecha- 
nisms in other animals and might succeed in Showing how the latter differ 
from those of man and how this difference accounts for the differences in the 
respective acoustical outputs. The results of this inquiry would explain why 
the acoustical signals emitted by men in Speaking differ from those of other 
animals. 

This is a very important area of study, and linguistics is vitally interested 
in these questions. Yet these questions do not exhaust the problems of concern 
to the linguist: they are but a small part of the puzzles that the linguist would 
like to solve. As a matter of fact, if linguistics were limited to a consideration 
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of these problems, there would hardly be any need for a separate discipline, 
since all of the above problems are dealt with by physiology and acoustics. 

What makes linguistics as a field of enquiry quite different from physio- 
logical acoustics is the fact that what is commonly referred to as «linguistic 
behavior » covers a much broader area than the acoustical properties of speech, 
though—as I have already said—it specifically includes the latter. Let me 
now describe a few of these additional problems. 

We have all had the experience of hearing people speak with a foreign 
accent. Thus, for instance, we all know people who are physiologically normal, 
who yet find it difficult to distinguish sounds that we ourselves have no diffi- 
culty whatever in distinguishing. For instances no English speaker would 
ever confuse the words «bitch» and « beach »—not even under conditions of 
high noise, as G. A. MILLER has shown. Yet a speaker of Russian or Italian 
would find it extremely difficult to keep them consistently apart. Clearly the 
difference in the behaviour of English and foreign speakers is not physio- 
logically determined, because the foreigner can—when his attention is drawn 
to it—make the required distinction. The difference in behavior is, of course, 
due to the fact that English, Russian, and Italian are different languages, and 
that different languages use different sounds. 

It may, therefore, be proposed that adult speakers have established a par- 
ticular behavior pattern of their vocal organs and that this behavior pattern 
accounts for the observed difficulty. Differences in language may, therefore, 


be equated with different habitual movements of the tongue and lips and with 


different co-ordinations of these movements. In other words, one might con- 
ceivably explain linguistic differences on a physiological-acoustical basis, 
provided one allowed for some learning. 

This, however, is not really an adequate explanation. Consider, for instance, 
the manner in which Latin is spoken by priests of different nationalities. An 


_ English-speaking priest may read mass with a sound repertory that is 100% 


English, and a French priest may read the same mass with a sound repertory 
that is 100% French. Yet there is no sense to the statement that the language 
of the mass is anything but Latin. 

An attempt may still be made to save the view of language as a purely 
physiological-acoustical phenomenon by saying that, e.g., the English priest 
uses the sounds of English with the statistics appropriate for Latin. This, 
however, is hardly a good solution since it raises a host of extremely difficult 
problems. #.g., it raises the question of how it is possible to identify an ut- 
terance as English on the basis of a very short sample, which might be totally 
atypical. But even if this were possible, there are aspects of linguistic behavior 
which cannot be explained in terms of physiology and acoustics alone, regard- 
less of the refinements introduced. I shall now give a few examples of this. 

A joke quite popular among elementary school children in America is the 
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following question and answer: Why can’t one starve in the desert? — Because of 
the sand which is there! The pun is based on the fact that word boundaries are 
not always marked acoustically, and sand which is is frequently indistinguishable 
from sandwiches. Yet word boundaries are crucial in understanding the mes- 
sage correctly, and given enough context the speaker of English will know 
how to assign word boundaries even if they are not acoustically marked. 

Word boundaries, moreover, are not the only boundaries which have no 
acoustical signal and which affect the behavior of the speaker. Consider the 
following ambiguities: 


Old | men and women Old men | and women 


He rolled | over the carpet He rolled over | the carpet 
which are due to differences in phrase structure that are not marked acoustically. 

I should like also to draw attention to another type of behavior. Every 
speaker of a language can perform rather elaborate transformations upon sen- 
tences. Thus, for instance, given a simple declarative sentence there is a 
standard way of converting it into a « yes or no» question; or given an active 
sentence there is a standard way for converting it into a passive. As an 
illustration of the latter take the sentence: A committee opposed the change in 
the bill which can be readily transformed into The change in the biil was opposed 
by a committee. In order to explain how to perform this operation we would. 
normally use such terms as noun phrase, verb phrase, transitive verb, ete., in 
the obvious way. It is important to note, however, that here, too, there is 
no such thing as an acoustical signal for these categories, yet the categories 
are essential in order to explain the speaker’s behavior. 

Consider again the sentence, The change in the bill was opposed by a com- 
mittee. The choice of was as against were is governed by the number (singular 
or plural) of the head of the first noun phrase; i.e. change. But the head of 
the noun phrase, which itself is a noun phrase, does not have any acoustical 
marker to distinguish it from other noun phrases. 

It must also be noted that the head of the noun phrase governs the choice 
of was as against were quite independently of the number of intervening words; 
e.9., The change in the bill for the promotion of the study of the mating calls of 
rhinoceri... ete... was opposed by a committee. 

Engineers and other non-linguists have usually neglected problems of the 
kind just surveyed, considering them either outside of their ken or relatively un- 
important refinements. Linguists, on the other hand, have been keenly inte- 
rested in such problems. The standard grammars of the different languages. 
always try to do something towards solving such problems. Unfortunately 
the standard grammars fail to be consistent or to make clear the basis on which 
they operate. In what follows I shall try to present in outline a descriptive 
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_ framework for language which I believe to be free of, at least, the most glaring 


of these failings. The exposition will begin with a review of some recent work 


of N. CHomsKy and will go on to a discussion of the phonic aspects of language, 


which were not considered by CHOMSKY. 


1. — Chomsky’s analysis. 


According to CHOMSKY every language has three distinct sets of rules which 
operate on three different levels. On the highest level the rules are all of 
the type « À — Y » where« X + Y » stands for «replace X by Y », with the 
restriction that not more than a single symbol can be replaced in a single rule 
and that X #Y. 


As an illustration of these rules we can take the following (*): 


Sentence > Noun Phrase + Verb Phrase + (Adverbial Phrase) (1) 
Noun Phrase + (Article) + Noun + (Prepositional Phrase) (2) 
Verb Phrase > Verb + (Noun Phrase) (3) 
Adverbial Phrase + Adverb (4a) 
» » > Piepasifional Phrase (4b) 
Prepositional Phrase + Preposition + Noun Phrase (5) 
Article — the (6a) 
» >a (60) 
Noun — committee (Ta) 
»  —> change (7b) 
» —> dog (7c) 
» — walk (7d) 
» — result (Te) 
» > bill (7 f) 
Verb + opposed (8a) 
» > took (80) 
»  —> barked (8e) 


(*) In applying a rule the symbols in parentheses may be omitted. The rules are 
only partially identical with those that would appear in an actual grammar of English. 


32 — Supplemento al Nuovo Cimento. 
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Preposition — of (20) 
» —> for (90) 
» > in (9e) 


The application of these rules yields a partially-ordered set of symbol 
Sequences. We shall call each symbol sequence, a string, and the set of such 
strings generated by the rules, a derivation. We may illustrate the process 
of applying the phrase structure rules by the following derivation: 


Sentence by rule 
Noun Phrase + Verb Phrase (1) 
Article + Noun + Verb Phrase (2) 
Article + Noun + Verb + Noun Phrase (3) 
Article + Noun + Verb + Article + Noun + Prepositional Phrase (4) 
Article + Noun + Verb + Article + Preposition + Noun Phrase (5) 


Article + Noun + Verb + Article + Noun + Preposition + Article + Noun (1) 


(65), ( 
(8a), (6a) 
( 
( 


a committee opposed the change in the bill (70) 
bi 


(6a), 


Attention must be drawn to the following facets of the grammar just 
presented: 


1) The order of application of the rules is partly fixed owing to the fact — 
that a given rule can be applied only if the symbol to be replaced—i.e., the 
one appearing on the left-hand side of the rule—appears in the derivation. 
There must, therefore, be at least one initial Symbol which must be supplied 
to the grammar from the outside and which starts things off. For the present 
seb of rules the symbol « Sentence » will serve this function. 


2) In order for the grammar to continue to operate it is necessary that 


instructions be provided for selecting the next rule to be applied. The instruc- 
tions must be supplied from the outside. It is by exercising a choice, by 
selecting one rule from a set of possible alternatives that information is being 


transmitted. This choice must evidently be made by the user of the grammar, 
for only he can transmit information. 
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3) The grammar continues to operate as long as the string contains 
symbols which themselves appear on the left-hand side of one or more rules. 
The grammar stops operating when it has produced a string consisting of 
symbols which occur only on the right-hand side of the rules—e.g., opposed 
in rule (8a)—and hence are «irreplaceable.» We shall call these « irrepla- 
- ceable » symbols, terminal symbols; strings consisting of terminal symbols only 
Shall be called terminal strings. 

It is always possible to convert a derivation into a tree like the one below. 


Sentence 


Be ag et 
- Noun Phrase Verb Phrase 
Te eee 
Bes a ee = 
Article Noun Verb Noun Phrase 
pz Ii 
ee PS 
Article Noun Prepositional Phrase 


CARS 


Preposition Noun Phrase 


ra 


Article Noun 


| | 


a committee opposed the change in the bill 


The tree may be familiar to some readers from their school days. It re- 
presents what is commonly known as ¢ parsing » or « diagramming » or « imme- 
diate constituent analysis» of the sentence. It contains at least a partial 
answer to the question of whence come the boundaries which in spite of their 
possible lack of acoustical correlate are nevertheless important factors in the 
behavior of speakers. 

The restriction on the number of symbols that can be rewritten in a single 
rule guarantees that given a terminal string—i.e. a string produced by the 
application of the phrase-structure rules—it will be possible to discover the 
associated tree or trees. Since not more than one symbol can be rewritten in 
a single rule, every line in the derivation must have at least as many symbols 
as the one preceding it. Since repetitions of lines in the derivation are not 
admitted (X  Y}), there must be a finite number of lines between the first 
line and the terminal string. One can, therefore, try out all one-line derivations, 
two-line derivations, three-line derivations, etc., until one comes upon a de- 
rivation having the desired terminal string. 

Since there may be more than one derivation yielding the same terminal 
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string, there may be more than one tree associated with a single terminal 
string. The fact that some terminal strings have more than one phrase structure 
representation accounts for the ambiguity of phrases like old men and women; 
he rolled over the carpet; ete. 

By repeated reapplication of rules (1) and (5) endless sequences of words 
may be generated. This is not an oversight but rather a reflection of the fact 
mentioned above that language places no upper bound on the length of sen- 
tences or of constituents, although all sentences are finite in length. 

We have made much of the tact that terminal strings have phrase structure. 
It is now necessary to point out that terminal strings are abstract represen- 
tations of certain seatures of sentences and that actual sentences are, in fact, 
not terminal strings. To see this, consider the English verb. Since verbs can 
be in the present tense as well as in the past we introduce a rule like the fol- 
lowing: 


Verb Phrase — Verb + (Past) + (Noun Phrase) (*) (3a) 


We would then also need rules like 


oppose + Past + opposed (10a) 
write + Past — wrote (10b) 
have + Past > had (10€) 
think + Past — thought (10d) 
be + Past — was (10e) 


Rule (10a) is within the restrictions imposed on phrase structure rules, 
for it requires in effect that the symbol « Past » be replaced by -d. The other 
four rules, however, violate the phrase structure constraints. E.g., in (100) 
the two symbols « write » and « Past » are replaced by «wrote » in one step, and 
it is impossible to achieve the same result if only a single symbol were allowed 
to be replaced in a single rule. Consequently, rules (105) to (10e) are beyond 
the power of the phrase structure level. Since all verbs violating the phrase 
structure constraints belong to the so-called « strong » or «irregular » verbs of 
English it may be proposed that these verbs be handled as exceptions; there 
would then be no need to utilize more powerful devices in the grammar. We 
shall see, however, that the phrase structure grammar is not powerful enough 
to handle other, perfectly regular verbal formations in a reasonably economical 


fashion. The proposal to consider the « strong » verbs as exceptions is, there- 
fore, of little practical importance. 


<% 


() We are disregarding the problems raised by number and person. i 
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Consider now the Verb Phrases: 


had opposed was opposing had been opposing 
had written was writing had been writing 
had had was having had been having 
had thought was thanking had been thinking 
had been was being had been being 


In order to generate the examples in the first column we should need the rule 


Verb Phrase — have (Past) +Verb-+ Perfect Participle + (Noun Phrase) (3b) 


as well as 

oppose + Perfect Participle + opposed (11a) 
write + Perfect Participle > written (115) 
have + Perfect Participle > had (116) 
think +4 Perfect Participle > thought (11d) 
be + Perfect Participle — been (11e) 


In order to generate the examples of the second column we should need 
the following rules: 


Verb Phrase — be+(Past) +Verb +Present Participle +(Noun Phrase) (3c) 
and 
Verb + Present Participle — Verb + -ing (12) 
Finally in order to generate the examples in the third column we need 
the following additional rule: 
Verb Phrase — have + (Past) + be + Perfect Participle + Verb + 
+ Present Participle + (Noun Phrase) (34) 


This rule, however, is the sum of rules (3a-c). It is, therefore, natural to in- 
vestigate whether the set of rules cannot be simplified. Examining rules (3a-d) 
we note the following regularities: 


a) The symbol « Past » is always associated with the first element of 
the Verb Phrase. 
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b) If the Verb Phrase contains the auxiliary verb have the symbol « Per- 
fect Participle » appears after the next element of the Verb Phrase. 


c) If the Verb Phrase contains the auxiliary verb be, the symbol « Pre- 
sent Participle » appears after the next element of the Verb Phrase. 


d) If both auxiliary verbs have and be occur in the same Verb Phrase, 
have precedes be. 


e) The only element which must appear in the Verb Phrase is (the 
main) Verb. 


f) The auxiliary verbs precede (the main) Verb. 


The simplest way of handling these regularities is by positing the following 
two rules: 


Verb Phrase + (Past) + (have + Perfect Participle) + 


+ (be + Present Participle) + Verb + (Noun Phrase) (3') 
and 


G4 Ve Va (Z) 


where V stands for any specific verb (lexical morpheme) like oppose, have, 
be, think, etc., and G stands for a grammatical operator like « Perfect Par- 
ticiple, » « Past, » ete. 

Rule (Z) goes clearly beyond phrase structure, for it changes the order 
of the symbols, and once the order of the symbols in the strings is changed, 
there is no longer any possibility of associating a tree with a string. We are, 
therefore, faced with the alternative of either maintaining the phrase structure 
restriction and thereby greatly complicating our description—e.g., we would 
be forced to have four separate rules in place of the single rule (3)—or of 
admitting into the grammar new rules that are more powerful than those of 
the phrase structure level. There are various reasons why the latter alter- 
native is to be preferred. Accordingly we establish a second grammatical 
level, which, following CHOMSKY, we call the transformational level. 

It is not possible here to go into the details of the transformational level. 
These can be found in CHOMSKY?s book Syntatic Structures. I should like, 
however, to draw attention to a few consequences of the decision to introduce 
the transformational rules, 

Since rule (Z) must precede rules like (10) and (11), the latter together 
with (Z) are part of the transformational level. This makes it unnecessary 
to do anything special about the « strong » verbs (rules (10b-d)), since on the 


transformational level the prohibition against replacing more than one symbol 
in a single rule does not hold. 
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The terminal strings, the final output of the phrase-structure rules, will 
contain symbols of two types: lexical morphemes like oppose, committee, of, 
the, etc., and grammatical operators like « Past,» « Perfect Participle, » ete. 
This is due to the fact that at least some grammatical operators cannot be 
replaced by phrase structure rules; e.g., « Past » is replaced in rules (10a-e), 
which are, however, transformational and not phrase structure rules. 

The terminal string corresponding to our sample sentence is therefore 
represented, with some simplifications and omissions, as follows: 


a + committee + Past + oppose + the + change + in + the + bill 


The transformational rules operate on terminal strings and the trees asso- 
ciated with them. The notion «head of noun phrase » which we have had 
occasion to use in the above discussion has an obvious and simple meaning 
if reference is made to the tree associated with the particular noun phrase. It 
is a matter of considerable difficulty to give a clear meaning to this notion if 
one limits oneself only to the terminal string. 

Up to this point we have been concerned exclusively with what might be 
termed abstract properties of language and we have said nothing of its acoustical 
features. It is now necessary to examine the relationship between the ab- 
stract entities that have been described in the preceding pages and the con- 
crete sound waves that comprise the spoken message. 


2. — Sounds of speech. 


The problem with which we shall be concerned in this lecture is the manner 
in which the sounds of speech are to be described. In every science the choice 
of a descriptive framework is an extremely important matter. It is usually 
not enough that the description reflect the physical facts to a sufficient degree 
of precision. We would like to describe these facts in such a way as to open 
up the possibility of saying other things of interest, too. The following example 
illustrates this point as it may affect the linguist. 

English speakers form the regular plural of nouns by adding a sound or 
sounds to the singular stem. They add [iz] if the noun ends in [s], [2], 
[8], [2], [6], [3], (e.g., busses, causes, bushes, garages, beaches, badges); they 
add [s] if the noun ends in [p], [f], [t], [9], [kK], (e.g., caps, cuff, cats, fourths, 
backs); and they add [z] in all other cases. 

In stating this we have, however, made a number of decisions regarding the 
manner in which we shall describe the facts. We have spoken of individual 
sounds—let us henceforth call them segments—and we have attached labels 
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to them; e.g., [s],[z]. We have decided in effect to view utterances as sequences. 
of a number of discrete entities. If we were asked why we made this decision 
we would surely reply that this seems to us to lead to a simple description of 
all kinds of facts. The questioner being a linguist in disguise might then 
point out that our description would be even simpler if we had a label for the 
segments [s], [z], [8], [Z], [¢], [3], and another one for the segments [p], [f], 
[t], [0], [K]. But this is indeed the case if we describe the segments with the 
help of any of the standard phonetic frameworks: the first set consists of 
the noisiest sounds in the English language, variously called hushing and 
hissing or strident sounds, and the second set contains only voiceless sounds. 
In other words, the classification of sounds into strident and not strident 
(mellow), and voiced and voiceless fits well with the above facts. 

We can now simplify the previous formulation in the following rules: 


R.1 If the noun ends in a strident consonant, then Plural + [1z]. 


R.2 If a noun ends in a consonant which is voiceless, but is not strident, 
Plural — [s]. 


R.3 In all other cases, Plural — [z]. 


In order to obtain simple rules we have described the utterances of English 
in a very special way. In particular we have regarded the utterances as con- 
sisting of sequences of discrete segments, and we have viewed the segments 
as simultaneous actualization of sets of attributes like voicing, stridency, con- 
sonantality, etc. 

It is a well-known fact that viewed as an acoustic phenomenon speech is” 
quasi-continuous; in many instances there is no obvious procedure for seg- 
menting the continuous acoustic signal in a way which would correspond with: 
the segmentation imposed by linguistic considerations. The question may, 
therefore, arise: in what sense can utterances be said to consist of discrete 
entities in sequence? ; 

While a rigorous segmentation procedure which would show in all cases 
a one-to-one correspondence with the linguistic representation, may not be 
possible, it is possible to construct devices which produce speech by utilizing 
a set of discrete instructions which coincide closely with the linguistic seg- 
mentation. The devices I have in mind are of the type of the Bell Telephone 
Laboratories’ Voder or the Haskins Laboratories’ Octopus. The signal emitted 
by these devices is continuous speech, yet the input instructions are discrete. 
There is, therefore; a good sense in which utterances can be said to be made. 
up of discrete segments. 

In addition to viewing utterances as consisting of discrete segments we 
have also viewed the Segments as simultaneous actualizations of a set of attri- 
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butes. In the descriptive framework with which we will be concerned below, 
the number of such attributes is quite small, about 15. These 15 attributes 
are sufficient to characterize all segments in all languages. Since we cannot 
have knowledge of all languages—e.g., of languages which will be spoken in 


_ the future—the preceding assertion must be understood as a statement about 


the nature of human language in general. It asserts in effect that human lan- 
guages are phonetically much alike, that they do not « differ from one another 
without limit and in unpredictable ways. » Like all generalizations this statement 
can be falsified by valid counter-examples. It can, however, not be proven true 
with the same conclusiveness. The best that can be done is to show that the 
available evidence makes it very likely that the statement is true. Most im- 
portant in this connection is the fact that all investigations in which large 
numbers of languages have been examined—from E. SIEVER’s Grundziige der 
Phonetik (1876) to TRUBETZKOY’s Grundziige der Phonologie (1939) and PIKh’s 
Phonetics (1943)—have operated with an extremely restricted set of attributes. 
If this can be done with about a hundred languages from all parts of the 
globe, there appears good reason to believe that a not greatly enlarged 
catalogue of attributes will be capable of handling the remaining languages 
as well. 

The phonetic attributes and the segments are devices in terms of which the 
linguist represents his data. Like descriptive parameters in other sciences, 
these do not always stand in a simple one-to-one relationship with the obser- 
vable facts. We have already had to remark on this indirect relation in the 
discussion of the segmentation of the utterance. A similar situation prevails 
with regard to the phonetic attributes. The absence of this simple relationship, 
however, does not mean that there is no specific connection between the de- 
scriptive devices and the data of linguistics. In the third lecture I shall attempt 
to outline this relationsip. 

If it is true that a small set of attributes suffices to describe the phonetic 
properties of all languages of the world, then it would appear quite likely that 
these attributes are connected with something fairly basic in man’s consti- 
tution, something which is quite independent of his cultural background. 
Psychologists might find it rewarding to investigate the phonetic attributes; 
for it is not inconceivable that these attributes will prove to be very productive 
parameters for describing man’s responses to auditory stimuli in general. It 
must, however, be noted that for purposes of linguistics, the lack of psycho- 
logical work in this area is not fatal. For the linguist it suffices if the attributes 
selected yield reasonable, elegant and insightful descriptions of all relevant 
linguistic data. 

The attributes in terms of which we shall describe the sounds of speech 
are due primarily to R. JAKOBSON. Following JAKOBSON, we shall call these 
attributes distinctive features. The distinctive features have been described in 


1317 


506 M. HALLE 


detail elsewhere. We shall, therefore, present here only the articulatory cor- 
relates of a few distinctive features (*). 


Articulatory correlates of the distinctive features (partial list). (**) 


1. Vocalic - nonvocalic. Single vocal cord source and absence of total occlu- 

sion in the oral cavity. 

Consonantal - nonconsonantal. Presence of major constriction in the central 

path through the oral cavity. 

3. Diffuse - nondiffuse. Oral cavity more constricted in front than at velum 
(backward flanged.) 


bo 


4. Compact - noncompact. Oral cavity more constricted at velum than in 
front (forward flanged, horn shaped.) 


Grave - acute. Major constriction in periphery (lips or velum) of oral cavity. 
Nasal - nonnasal. Velum lowered. 

Voiced - unvoiced. Vocal cords vibrating. 

Flat - natural. Lips rounded. 


i pi olan aa 


Continuant - interrupted. No stoppage of air flow through mouth. 


The first two features produce a quadri-partite division of the sounds of 
speech into 1) Vowels, which are vocalic and nonconsonantal; 2) Liquids, 
[r], [1], which are vocalic and consonantal; 3) Consonants, which are non- 
vocalic and consonantal; and 4) Glides, [h], [w], [j], which are nonvocalie 
and nonconsonantal. 

Like all phonetic frameworks, the distinctive feature system is a catalogue 
of attributes. The distinctive feature system differs from other phonetic frame- 
works in that it contains only binary attributes. A segment, e.g., is either 
voiced or voiceless, and there are no intermediate degrees of voicing ot which 
cognizance needs to be taken. 

The question may well arise whether this is more than an empty trick, 
since any number of distinctions can always be expressed in terms of binary. 


(*) The fact that in the following list, reference is made only to the articulatory 
properties of speech and nothing is said about the acoustical properties, is not to be 
taken as an indication that the latter are somehow less important. The only reason 
for concentrating here exclusively on the former is that these are more readily observed 
without instruments. If reference were to be made to the acoustical properties of 
speech it would be necessary to report on experimental findings of fair complexity 
which would expand the present lecture beyond its allowed limits. 

(") Each feature is designated by a pair of antonymous adjectives, which, in accor- 
dance with the following convention, are used also to designate the segments. If the 
given description applies to a segment, it is designated by the first adjective; if the 
description does not apply, the segment is designated by the second adjective. 
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| properties. All phonetic frameworks incorporate a large number of binary 


attributes: ¢.g., voicing, nasality, rounding, aspiration, palatalization, etc. It is, 
of course, possible to replace these attributes by multi-valued properties. No 
one has ever shown, however, that anything is to be gained by this substi- 
tution. The replacement of multi-valued properties by binary features, on 
the contrary, does result in a gain. 

In order to see this we shall examine the so-called point of articulation. 
The «point of articulation » is the place of maximum constriction in the oral 
cavity, and it has been customary to describe consonants in terms of this 
point. Thus, for instance, [p] is usually said to have a bilabial point of arti- 
culation, [f], a labio-dental point of articulation, [t], a dental or post-dental 
point of articulation, [k], a velar point of articulation, ete. No limitation is 
placed on the number of such points. In any given language, however, the 
number of separate points that need to be recognized is rather small. As a 
matter of fact, it can be shown that four such points suffice to describe all 
relevant facts in any known language. Instead of the multi-valued point of 
articulation dimension, the distinctive feature system contains the two features 
compact-noncompact and grave-acute, which distinguish the required four 
classes of segments: [p] is noncompact grave, [t] is noncompact acute, [c] as 
in keys is compact acute and [k] as in cool is compact grave. 

The distinctive feature system employs less descriptive machinery than do 
other phonetic systems. Whereas in other systems the number of possible 
points of articulation is not restricted, in the distinctive feature system there 
are only as many different classes as are absolutely necessary. The decision 
to replace the point of articulation by two binary features, however, has other 
interesting consequences as well; e.g., it makes it possible to explain in a 
simple manner certain linguistic changes which have puzzled linguists for a 
long time. One such example we shall examine in some detail. 

It has been observed that when sounds change, these changes are gradual. 
E.g., it is quite common for a voiced consonant to change into its voiceless 
cognate or vice versa ([v] >[f] or [k] >[g]); it is uncommon, or perhaps 
even unknown, for a voiceless consonant to change into a vowel ([k] -+ [u]; 
[f] -+ [a]). This observation can be conveniently expressed in terms of distine- 
tive features as follows: a sound change rarely affects more than one feature. 

In certain languages it has been found that [k] changes into [p] or vice 
versa. In terms of the multi-valued point of articulation this change is rather 
surprising, for [p] and [k] are produced with constrictions at opposite ends of 
the oral cavity. One might expect a change of [p] to[t] since they have adjacent 
points of articulation, but it seems rather curious that [p] and [k], which are 
articulated at such widely separated points should be confused. The distine- 
tive feature system, however, provides a simple explanation for the puzzle. In 
terms of the distinctive features [p] and [k] differ in only a single feature: [p] 
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is noncompact and and [k] is compact. Consequently, the change of [p] into 
[k] is structurally quite similar to the change of a voiced consonant into its 
voiceless cognate. 

The second difference between most standard systems and the distinctive 
feature system lies in the treatment of the two major classes of segments, the 
vowels and the consonants. In most standard systems these two classes are 
described in terms of features which are totally different: consonants are de- 
scribed in terms of the « points of articulation, » whereas vowels are described 
in terms of the so-called « vowel triangle. » In the distinctive feature system, 
on the other hand these two classes are handled by the same features: compact- 
noncompact, (diffuse-nondiffuse) and grave-acute. The distinctive feature 
system is thus more economical than other phonetic systems (*). 


3. — Phonology. 


Utterances are represented as sequences of distinctive feature segments. 
Although in many instances the latter stand in a one: one relationship with 
the sounds that we speak and hear, there are many instances where this re- 
lation is anything but simple. It is the major aim of the present lecture to 
elucidate this connection. The part of linguistics that is concerned with this 
problem is called phonology. , 

The phrase structure grammar, which was presented in Sect. 1, con- 
tained rules like « Noun > committee, bill, etc.» - cf., rules (6)-(9). These 
rules are basically lists of all existing morphemes in the language. Our pur- 
pose in preparing a scientific description of a language is, however, not achieved 
if we give only an inventory of all existing morphemes; we must also describe” 
the structural principles which underlie all existing forms. Just as syntax is 
not identical with an inventory of all observed sentences of a language; so 
phonology—i.e., a description of its phonic aspects—is not identical with a 
list of existing morphemes. 

In order to generate a Specific sentence it is necessary to supply to the” 
grammar instructions for selecting from the lists of morphemes—i.e., from the 
morphemes appearing on the right hand side of rules (6)-(9)—the particular 
morphemes appearing in the sentence. Instead of using an arbitrary numerical 
code which tells us nothing about the phonetic structure of the morphemes, 
it is possible—and also more consonant with the aims of a linguistic description— — 
to utilize for this purpose the distinctive feature representation of the mor- 


(*) It is curious to note that the Hindu phoneticians had the idea of treating vowels 
and consonant together over 2000 years ago. Their solution differs from the one 


proposed here in that it classified vowels as well as consonants in terms of their points 
of articulation. 
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- phemes directly. In other words, instead of instructing the grammar to select 


noun (7f), we instruct the grammar to select the noun which in its first seg- 
ment has the features: nonvocalic, consonantal, noncompact, grave, voiced, etc. ; 
in its second segment, the features: vocalic, nonconsonantal, diffuse, acute, etc. ; 
in its third segment, the features: vocalic, consonantal, etc. Instructions of 
this type need not contain information about all features but only about fea- 
tures or feature combinations which serve to distinguish one morpheme from 
another. This is a very important fact since in every language only certain 
features or feature combinations can serve to distinguish morphemes from one 
another. We call these features and feature combinations phonemic, and we 
can say that in the input instructions only phonemic features or feature com- 
binations must occur. 

Languages differ also in the way they handle nonphonemic features or 
feature combinations. For some of the nonphonemic features there are de- 
finite rules; for others the decision is left up to the speaker who can do as he 
likes. Æ.g., the feature of aspiration is nonphonemic in English; its occur- 
rence is subject to the following conditions: 


a) All segments other than the voiceless stops [K],[p], [t] are unaspirated. 
b) The voiceless stops are never aspirated after [8]. 


c) Except after [s], voiceless stops are always aspirated before an ac- 
cented vowel. 


d) In all other positions, aspiration of voiceless stops is optional. 


A complete grammar must obviously contain a statement of such facts, 
for they are of crucial importance to one who would speak the language cor- 
rectly. 

In addition to features like aspiration in English, which are never phone- 
mic, there are features in every language which are phonemic, only in those 
segments where they occur in conjunction with certain other features, and are 
not phonemic in other segments. £.g., in English the feature of voicing is 
phonemic only in the nonnasal consonants; all other segments except [h] are 
normally voiced, while [h] is voiceless. 

So far we have dealt only with features which are nonphonemic regardless 
of neighboring segments. There are also cases where features are nonphonemic 
because they occur in the vicinity of certain other segments. 

As an example we might take the segment sequences at the beginning of 
English words. It will be recalled that the features vocalic-nonvocalic and 
consonantal-nonconsonantal distinguish four classes of segments: Vowels, sym- 
bolized here by V, are vocalic and nonconsonantal; Consonants, symbolized 
by ©, are nonvocalic and consonantal; Liquids [1], [1], symbolized by L, 
are vocalic and consonantal; the Glide [h], symbolized by H, is nonvocalic 


il 
Gol 
mi 
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and consonantal (*). We shall be concerned solely with restrictions on these 
four classes; all further restrictions within the classes are disregarded here. 

English morphemes can begin only with V, CV, LV, HV, CCV, CLV; 
and CCLV: e.g., odd, do, rue, who, stew, clew, screw. A number of sequences 
are not admitted initially; e.g., LOV, HLV. These constraints are reflected 
in the following three rules which are part of the grammar of English: 


Rule MS1: If a morpheme begins with a consonant followed by a nonvocalie 
segment, the latter is also consonantal. 


Rule MS2: If a morpheme begins with a sequence of two consonants, the: 
third segment in the sequence is vocalic. 


Rule MS3: If between the beginning of a morpheme and a liquid or a glide 
no vowel intervenes, the segment following the liquid or the 
glide is a vowel. 


These rules enable us to Specify uniquely a number of features in certain 
Segment sequences; e.g., 


vocalic = a 
is converted by rules MS1, 2 and 3 into 
consonantal SE + 
vocalie = Sa SE, abe : 
"A which stands for a sequence CCLV:. 
consonantal + + + — |69: straw, 


The MS rules are partially ordered. If the order is not imposed they 
will have to be given in a much more complex form. Let us now introduce. 
the convention that whenever a feature is not specified in a Segment, a zero. 
Shall be written in the appropriate column and row. We shall Say, therefore, 


that a zero stands for an unspecified feature, and a plus or a minus, for a 


Specified feature. In terms of this convention the Sequence of columns repre- 
senting the different morphemes—i.e., the input instructions for phrase strue- 
ture rules (6)-(9)—will contain many zeros; indeed as many zeros as are com- 
patible with attaining the aims of the grammar. 


We define an order-relation between segment-types: We Shall say that 


() We consider the semivowels [j] as in you and [w] as in woo to be positional 
variants of the vowels [i] and [u], respectively. 


1322 


of 
4 QUESTIONS OF LINGUISTICS 511 


Segment-type A is « contained » in segment-type B, if and only if the following 
two conditions are satisfied: 1) all specified features of A are found with the 
identical values (the same pluses and minuses) in B; and 2) at least one feature 
specified in B is unspecified (has a zero) in A. The set of all elements not 
«contained » in any other element is called the set of maximal segmenttypes. 


Examples: 
A B C 
¥ A is «contained » in C. The set of maximal 
Es F1 =f = = A 
3 segment-types is {B, C}. 
È F2 PR STA IS | 
A B C 
2 F1 de na, 0 all segment-types are maximal. 
5 
bi 
es F2 0 + — 


It has often been observed in linguistics that the primary function of the 
| phonemes of a language is to distinguish one morpheme from another. It is, 
therefore, natural to require that the set of phonemes of a language be a set: 
of maximal segment-types. In other words, given any two phonemes of a 
language, it must be the case that for at least one feature, one phoneme has 
a plus where the other phoneme has a minus, or vice versa. 
Hach specified feature in a segment represents a piece of information that 
“must be provided in the input instructions. If our grammar is a realistic picture 
of the language, then this information must be supplied by the speaker. Since 
we speak quite rapidly—at a rate which may be as high as 20 segments per 
second—it is only reasonable to assume that the number of specified features 
in the input instructions is consistently kept at a minimum. One way of ap- 
proaching this desideratum is by minimizing the number of specified features 
per phoneme. It can be shown that if this condition is imposed on a set of 
maximal segment-types, it will be possible to map into a branching diagram 
the matrix representing the set of segment-types, in such a way that if to 
each node a particular feature is assigned, then each path through the diagram 
beginning at the initial node and ending at the end points of the branching 
diagram represents a phoneme. 
In order to see what is involved consider the following sets of maximal 


; segment-types. 


q 


- 
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A B C 
Pi dale > ? This set of maximal segment-types is 
È varo: È not mappable into a branching diagram. 
= F2 0 St = i 
vo 
E 
F3 a 0 = 


te. 
F1 
F1 | + o Le EN 7m De 
È ABC DEF 
5 F2 — | + + 0 — | + 
È F2 F3 
F3 Or} — + = | 4 lot / | \ 
Note that in the left branch of this branching dia- BC EF 
gram, F2 precedes F3, while in the right branch the | pa aN 
inverse order obtains. Without this reversal in the A RG D° Eee 
order of the features, the above set of maximal seg- 
ment-types is not mappable into a branching diagram. ' 
À 
; 
ANY BOTA FI | 
F1 = + + + ABC DEF | 
È i i 
5 F2 rae M comm i UE Su ge F2 F2 : 
zai / 
F3 "VOTO SRE ME à GE Re aa i 
n BC EF 
This set of maximal segment-types can be mapped F3 F3 


into a branching diagram with a unique ordering of A ron D Ko 
the features. 


The possibility of mapping a distinctive feature matrix into a branchin 
diagram hinges upon the existence in the matrix of at least one feature fo 


à 
a 
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which there are no zeros. This feature, which must be assigned to the first 
node, subdivides the segment-types into two classes. The next two nodes 
must be assigned to features which have no zeros for any of the segments in 
the two sub-classes. These may be the same or different features. The same 
procedure must again be possible with regard to the segments in each of the 
four sub-classes established by the former features; etc. When a sub-class 
contains a single segment-type, the segment-type is fully specified, and the 
path through the branching diagram represents exactly its distinctive feature 
composition. The two conditions establish a hierarchy among the features. 
This hierarchy, however, need not be complete. For instance, when there 
are in the matrix two features which contain no zeros, there is no reason 
to put one feature before the other; any order will be satisfactory. Partial 
ordering of features for different reasons is illustrated in the second example 
above. 

The hierarchy of features established by the two formal conditions imposed 
on phonemes provides an explanation for a number of observations made by 
linguists. It accounts, e.g., for the intuition that the distinction between vowels 
and consonants is somehow more crucial to the phonological system than the 
distinction between accented and unaccented vowels, or between stops and 
continuants. Since in all phonological systems it happens to be the case that 
the features vocalic-nonvocalic and consonantal-nonconsonantal must precede 
all other features, it is quite natural that the segment classes established by 
these two features should be felt to be more central than other classifications 
of segments. 

An interesting result of a different sort is obtained in the case of the Finnish 
vowel system. Finnish has the eight vowel phonemes which can be charac- 
terized by means of the following distinctive feature matrix. 


[æ] [a] [e] [6] [o] [1] | [ù] [u] 
flat 3 sr E = Sa Ste 
compact ae ae = = = = = == 
diffuse — = = = = ce = 
grave - nia =; = Se = = 


Since, however, it is necessary to minimize the number of specified features 
per segment, we replace certain specified features by zeros as follows: 


33 - Supplemento al Nuovo Cimento. 


1325 


514 M. HALLE 
[ee] [a] | [e] [6] | [o] Bi) | [a] [u] | 
à Die Soi | | È | 10 
flat a SS Ne + | a a 7-4, | 
compact + + ao 0 0 | — 0 | 0 
» | | ee: | 
diffuse 0 0 na e 4 + + | 
grave-acute -- + | 0 | — | 0 _- — | 
Flat 
oi à Seo = 
= = 
[2 a e i] [6 où u] 
Compact Diffuse 
[ei] [æ a] [6 0] [ü u] 
Diffuse Grave Grave Grave 
as / 


N pom / x 4 Lot 
[el Ti] [æ] [a] [6] [o] [a] [a] 


This replacement of specified features by zero has, however, an interesting 
parallel. Finnish is one of the languages which possess vowel harmony; i.e, 
there is a restriction on what vowels can occur in a single word. In the case 
of Finnish, a word can contain as election either from the set [æ, 6, ti, e, i] or 
from the set [a, 0, u, e, i]. The minimal distinctive feature matrix provides us 
with a very elegant formula for the description of these facts; i.e., a Finnish 
word cannot contain both grave and acute vowels. The formula holds only for 
the abstract representation of the phonemes as it is embodied in the matrix, for 
physically speaking [e] and [i] are both acute. In the construction of the 
Finnish word, these two phonemes, however, do not behave like other acute. 
vowels. The formal requirements imposed on phonemes force us to treat fe] 
and [i] as vowels which are neutral with regard to the feature grave-acute, 
and indeed this is how these phonemes appear to be treated by the language. 

The reasons advanced for reducing the number of specified features in the 
input instructions do not hold only in the case of phonemes. As we have 
seen in the discussion of the segment-sequences that are admitted at the be- 
ginning of an English word, under certain conditions not all features which 
must normally be specified in a phoneme serve to distinguish one morpheme 
from another. We have, however, not required that the input instructions 
consist entirely of phonemes. We can now take advantage of this and leave | 
unspecified in the input instructions all features that are not phonemic. The 


| 
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… rules of the grammar will insure in such cases that unspecified features are 
specified so as to yield the correct phonetic consequences; i.¢., possible English 
utterances. 

The question which we have as yet not discussed is at what point in the 
grammar must we place the various rules that reflect the constraints on feature 
combinations. At first sight it may appear desirable to place all of them at 
the end, after the operation of the transformational rules, since it is only at 
the end of the transformations that all grammatical operators—i.e., symbols 
like « Past, »« Plural, » etc.—are converted into features or feature segments. 
If we were to apply the above rules before the transformations it would be 
necessary either to apply the same rules again, in order to handle those feature 
segments that were introduced by the transformational rules, or to specify 
many more features in the output of the transformational rules. I shall now 
attempt to present reasons why it is necessary to apply some rules reflecting 
constraints on feature combinations before the transformations. 

Since it is always possible to add new words to the language the lists of 
morphemes must not be considered closed. The rules which reflect the con- 
straints on feature combinations do not enable us to develop a procedure for 
discovering the most economical distinctive feature representation for every 
morpheme; this can be found only by repeated trial and error. Consequently, 
it is not possible to predict a priori what types of distinctive feature columns 
will appear in the representations of the different morphemes, for it is con- 
ceivable that a new morpheme to be introduced in the future will require for 
its most economical representation a distinctive feature column that is not 
otherwise found in the language. 

The above fact has important consequences for the construction of the 
erammar. We have just said in effect that we do not have a way for deter- 
mining what distinctive feature columns (segment-types) will appear in the 

terminal strings after the application of the phrase structure rules. In many 
| languages—though perhaps not in all languages—there are certain transfor- 
mational rules which require that certain features be specified. As an example 
consider the plural of the English noun «straw » [str'o]. As was shown at the 
beginning of this lecture the features vocalic-nonvocalic and consonantal- 
nonconsonantal would be represented in this morpheme as follows: 


voealic | — — 0 0 


consonantal | + 0 + 0 


Iu other words, in the input instruction there would be no statement re- 
garding the nature of the last segment. In order to select the correct plural 
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ending for this noun, however, it is necessary to know its last segment (*). 
This information is contained in the rules reflecting the constraints on feature 
combinations; è.e. in rules MS1, MS2, MS3. It is necessary, therefore, to apply 
these rules before the rule forming the plural of nouns, or more generally be- 
fore all transformational rules. I believe that the dividing line between the 
rules that have to be applied before the transformations—let us call them the 
morpheme structure rules—and those that have to be applied after the trans- 
formations—let us call these the phonological rules—can be drawn by requiring 
that the application of the morpheme structure rules result in segment-types 
which are specified to a point where the entire set of segment-types is map- 
pable into a branching diagram in which each segment-type is represented 
by a distinct path through the diagram, all paths beginning at the initial node, 
but not necessarily ending in an end point. In other words, at this point the 
segment-types admitted in the representation are either phonemes or segment- 
types which are «contained in» phonemes. We shall call the latter segment 
types archiphonemes. Since, however, the entire set must be mappable into 
a branching diagram a feature specified in a phoneme can remain unspecified 
in an archiphoneme only if all features below it in the hierarchy established 
by the branching diagram also remain unspecified. 

Since the morpheme structure rules must be applied before the trans- 
formations, it is natural to include them in the phrase structure level rather 
than set up a separate linguistic level containing just these rules. The MS 
rules must, therefore, be of the same structure as other phrase structure rules; 
they must, e.g., not violate the restriction against rewriting more than one 
symbol in a single rule. They can not result, therefore, in the elimination of 
entire segments from the representation. Such rules, which are necessary in 
certain instances, will have to be included in another part of the grammar. 

All remaining rules dealing with constraints on feature combinations are 
to be applied after the transformations. Since these rules differ from the trans- 
formations in two significant respects—namely,. all the rules are obligatory; 
î.e., require no external instructions to be put into operation; and the rules 
do not require reference to other, earlier strings in the derivation—it is sim- 
plest to set up a special linguistic level containing only these rules. We call 
this third linguistic level the phonological level. The rules of the phonological 
level complete the specification of the phonetic properties of the utterance in 
so far as these are governed by the rules of the language. Phonetic properties 
whose actualization is left to the free will of the speaker are not specified by 


these or any other rules. They are beyond the purview of the science of lin- 
guistics. 


() The rule governing the selection of the plural endings in English is stated 
at the beginning of Sect. 2. 
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It is well-known, how a simple and economic theory may transform an 
empirical law from something quite amazing and difficult to believe, into some- 
thing almost obvious and even trivial. It seems that such will be the final 
fate of certain laws of linguistics; the relationship between rank and frequency 
for natural words, and the relationship between species and genera in natural 
taxonomies. Let us recall that these laws were discovered by J. B. Esrour 
and J. C. WILLIS, respectively, but were made well known by the publications 
of G. K. Zrpr [1]. The author’s models for these results were published since 
1951, and their final form was given in 1957. Although these theories are 
essentially very simple, we have not yet found a way of developing them fully 
in a few pages. We shall therefore limit ourselves in this Note to a bare 
outline of the theory of the frequency distribution for natural words. 

The first main tool of the theory is the following relationship: 


C =—log: p, 


where p is the probability of occurrence of some signal in a message and € 


is the «cost» of transmitting this signal in some optimal binary code. This 
relationship is extremely familiar in information theory and may be obtained 
under a wide variety of definitions of optimality; we shall not attempt here 
to reduce this relationship to more fundamental concepts. Further, we shall 
not restrict ourselves to binary codes, and shall write: 


(1) BC =—log, p, 


where is a factor which depends upon the scale chosen for C. 
Let us apply the relationship (1) to the words of natural language. Each 
word will be labelled by the rank, which it occupies in a list of all words, 


1330 


Mecs : 


STATISTICAL MACRO-LINGUISTICS 519 


arranged by order of decreasing probability in a given text: that ise fa 
designates the most frequent word, r= 2, the second most frequent, etc.; the 
number of words more frequent than a word of frequency p will be r(p) —1. 
Then, the empirical result is that for words other than the most frequent 
ones (large 7) one has, whichever the language in which a test was written: 


[2) p(r) br, 


where P and B are some constants. The relationship (1) then becomes: 


BC =—log P+ Blogr, 
log P ; 
log r= B Fo = log Kk +80, 


(by definition of K and /'); finally, 
(3) | r= K exp[f'C], 


An «explanation » of the law of Zipf requires an interpretation of the 
« cost » of coding a word and a model for the structure of the word, which 
together would lead to (3). One reasonable interpretation of « cost» would 
be the number of letters required for the code. It turns out actually that 
this interpretation cannot be carried to the end, and one must rather think 
of the cost as being something like the time required to read a word [2]. How- 
ever we shall sketch a theory based upon the identification of cost to (essen- 
tially) the number of letters. The second step is the choice of the rule of 
formation of words: in the present model, one will assume that a word is any 
sequence of letters contained between successive occurrences of some addi- 
tional improper letter, the « space ». 

It is then reasonable to interpret cost as being equal to the number of 
proper letters, plus the cost ©, of the improper letter « space ». Let there be 
M different proper letters. Then 


there is af word of cost © 


there are M words of cost (+1 


there are M? words of cost (4 + 2, ete. 


Adding, one finds that 


n 


STRESA words of cost less than C= ©, + n. 


there are 
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For large n, this gives 
r= :K' M°-%— K exp[ Clog M} 


which is of the form required to explain Zipf's data on word frequency for large r. 

It is unfortunate that the simplest case above cannot be carried out to 
further steps without some difficulties. However, it turns out that the same 
result (3) can be obtained under wider, more realistic and mathematically 
more convenient conditions, as long as a word is a sequence of letters contained 
between two successive spaces, as long as there is «little interaction » between suc- 
cessive proper letters, and as soon as one can justify (1). 

A closer examination of the cost of coding for small values of 7 suggests 
the following improvement of the law (2), valid for all r, 


p(r) = (B —1)V?(r + V)-, 


where V is a second coefficient. This further approximation turns out to be 
experimentally excellent (*). 


(*) Other explanations may eventually be given of the rank frequency relation (2). 
However, the explanation suggested by H. A. Simon is certainly incorrect. See our 
Note on a class of skew distribution functions, in Information and Control, 2, 90 (1959). 
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1. — Introduction. 


Neuroanatomy is what is left when all the problems about the neuron are 
solved. Most of these problems, however, are still unsolved and much con- 
fusion has been generated through the assumption of clear-cut findings where 
only hypotheses where intended. The diagram below (the McCuLLocH and 
Pitts abstraction) stands for the logical expression (in Hilbert and Ackermann 
notation): 


Ev (ABD) v (ACD) v (BCD) . 


It has but a very tenuous relation to neurons. Nobody has ever seen, pre- 
sumably, the complete anatomical structure representing a junction such as 
(A-E or D-E in the case of two real nerve cells and nobody knows the exact 
difference between junctions of the excitatory (A-H, B-H, C-E) and of the 
inhibitory kind (D-H). Also, it is very doubtful whether chains of events in 
a nervous system can always be expressed by enumerating neurons which in 
an «all or non» fashion fired in succession. Ional changes taking place in 
limited portions of the neurons, and such changes as cannot be described in 
either of the categories of the «firing» or «not firing» of a neuron may 
play a much more remarkable 


role than would appear from e ; = 
the first approximation models. es 5 < 
Thus it is not at all sure that the ere 

two branches E’ and £" in Fig. 1 Fig.l. 


(*) The present research has been sponsored in part by the European Office of the 
Air Research and Development Command, U.S. Air Force. 
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will always «fire» when the axon £ fires etc. There are no experiments to 
indicate the exact extent to which neurons influence each other outside of 
the «official » junctions. Whether activity in dendrites can produce activity 
in neighbouring dendrites directly and what sort of changes mediate this inter- 
action, is just as obscure as whether similar interactions occur among axons 
in a parallel bundle. 


2. — Methods. 


Evidence about the texture of the brain is derived from the following funda- 
mental methods of investigation: 


a) Fiber tracing: The fact that the nervous system is largely built out 
of long fibres was obvious to old anatomists who investigated slightly mace- 
rated brains with the aid of wooden sticks and skilful dissection. Various pheno- 
mena which proved to remain limited to fibers or fiber bundles without spreading 
at right angles to the direction of the fibers helped to individuate pathways. 
Such phenomena were the degeneration i.e. anatomical disrupture of entire 
fibers following the experimental destruction of parts of the fibers, and the 
possibility of recording localized electric potentials after localized stimu- 
lation as evidence of existing fiber connections between the site of recording 
and the site of stimulation. , 

Theories elaborated by workers who used these techniques are sustained 
by the hopeful assumption that the knowledge of the connections between 
input stations, elaborating stations and output stations would in itself provide 
an answer to most problems. The aspects completely neglected are those re- 
garding the junctions, the conditions for the propagation of signals in knots, 
i.e. places where fibers bifurcate or are confluent. 


b) Golgi method: A very odd chemical reaction between a chromate 
solution, with which pieces of brain have been impregnated, and silver nitrate, 
into which these pieces are subsequently placed, may produce nuclei of preci- | 
pitate inside a cell which grow continuously into all its ramifications without 
ever trespassing onto the outside or onto neighbouring cells. This produces 
an unpredictable selection of very few neurons out of a large neuron popu- 
lation, with the unestimable advantage of a « histochemical dissection » of the 
complete ramifications of the cell, which would of course remain inextricable 
if all neurons of the net were Stained. 

The theoretical bias produced by this view of the nerve nets was an over- 
estimation of the importance of minute differences of nerve cell shape, while 
the statistical distribution of the different types was largely neglected. The 


neuron theory, with the dogmatic assumption of the unit character of nerve 
cells was another outcome. 
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c) Aniline dyes: It is possible to stain the «cell bodies » of neurons 
selectively without staining any of the long branches and fibers, thus likening 
neurons to other cells of the organism. This technique has the advantage of 
completeness (it does not leave out any cells) and is the basis of cell counts, 


| since it is obviously easier to count small distinct dots than closely packed 


\ 


Stars or brushes. The importance of this method has been grossly overesti- 
mated, and the quotation «nil bonum e Nissl» becomes understandable in 
view of the fact that these methods (the « Nissl methods ») leave us comple- 
tely in the dark both about the junctions and the pathways themselves. 


d) Fiber staining: Certain staining methods reveal remarkable local dif- 
ferences in different feltworks of nerve fibers. The intensity of the stain proves 
to coincide generally with the presence of long and thick fibers, whilst an 
equally dense feltwork of thin and short fibers stains very lightly. This pheno- 
menon is due to a fatty substance, myelin, which covers the thicker fibers 
and performs with all probability the function of an insulator. The study of 
nerve nets with the aid of this method has not given all the results it may 
yet give. Under the general assumption that contacts between fibers are made 
everywhere except in the insulated portions, the presence and the orientation 
of long insulated fast conducting (see below, Sect. 3) segments reveals the 
most important exceptions to the randomness of a net. 


3. — The basic element. 


In view of the uncertainty of the translation of the neurological reality 
into logical diagrams (Sect. 1) the basic element should be carefully defined 
and should not be burdened with too many unproven assumptions. 


a) The basic element of nerve nets is the insulated nerve fiber. It varies 
in length between 10 um and over 1 m, and in thickness between about 1/10 um 
and over 20 um. 

The operation performed by such a fiber is the transmission of an event 
from one end to the other without interference with other events in other 
fibers. 


b) The direction of transmission is fixed for each fiber. 


c) The velocity of the transmission varies between a few cm/s and over 
100 m/s. It is fixed for each fiber and varies directly with the thickness of 
the fiber (the thicker the faster). 


d) The transmission of an event is a consequence of the occurrence of 
« excitation » within a region, called pickup-field, associated with each fiber. 
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e) The transmission of an event produces « excitation » within a region, 
called field of excitation, associated with each fiber. Field of excitation and 
pickup field of the same or of different fibers may be overlapping. 


The phenomena called «event» and «excitation» are complex physico- 
chemical changes. It is convenient to treat the event which is transmitted 
in nerve fibers as a binary signal and to relegate to the intervenient excitation 
those effects which appear as monotonic (« facilitation ») or non monotonic 
(«inhibition ») functions of the number of active fibers. 

The places where the quantity excitation determines the transmission or 
not transmission of an event in fibers are the knots of a net. Note that ac- 
cording to the definitions given the knots include, besides the « synaptie jun- 
ctions » between neurons, also branching points of axons (which are generally 
uninsulated) and, generally, uninsulated segments. In praxis (Sect. 5 b, c, d) 
we shall have to revert to the conventional identification of fibers with « neu- 
rons » due to the lack of precise physiological data on the interactions outside 
of the synaptic regions. 


4. — Some simple organs composed of fibers. 


a) Bundles. Large parts of all brains are composed of bundles of pa- 
rallel fibers. The number of fibers in a bundle is mostly of the order 10° up 
to 10’, smaller bundles (10--100 fibers) being rare and larger assemblies being 
mostly composites in no way describable as uniform organs (example, white 
matter of the hemispheres). 

All fibers in a bundle transmit events in the same direction. If fibers of 
opposite direction are found in what appears macroscopically as one bundle, 
we prefer to speak of two bundles. 

The operation performed by a bundle of parallel fibers is the translation 
of patterns of excitation and the production of a delay. 


b) Branching bundles. Microscopical exami- 
nation of fiber masses in the brain reveals places 
where all fibers of a bundle bifurcate, giving rise to 
two secondary bundles. When the direction of the 
fibers is known, it is always apparent that only one 

Fig. 2. of the three branches of such a knot is afferent, i.e. 
conducts towards the knot (Pig. 2); 

In the simplest case an organ composed of parallel, branching fibers per- 
forms the operations of translation, multiplication and delay of patterns of 
excitation. 

In some cases such bifurcations will however be equivalent to the endpoints 
of fibers in the sense defined in Sect. 3. In other words, excitation produced 
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by neighbouring fibers may determine the transmission or not transmission 
of an event beyond a bifurcation. The classical example of this is given by 
the experiments of BARRON and MATHEWS (1935). 


c) Loops or delay organs. If we eliminate the factor « translation » from 
the operation of a bundle by bending it into a closed ring and 
bringing the two endpoints of each fiber together in close +» 
proximity (Fig. 3) we obtain an organ capable of producing «x ) 
a fixed delay in a pattern of excitation. There is no evidence 


. = E Hip 3. 
that such an organ exists in natural brains. DI 


d) Tapering bundles. The simplest relation between a bundle of fibers 
and some other organ (see Sect. 5, the griseum) is that of a one to one cor- 
respondence of fibers and subdivisions of that organ. The fibers may be affe- 
rent or efferent or mixed. The anatomical expression of this relation is a 

tapering bundle (Fig. 4), in the simplest case of 

a one dimensional array, a linear fall etc. Numerous 

a examples could be quoted (« fimbria of the hippo- 

Pa campus », «optic fibers in lower vertebrate tec- 
tum », « bundles in the lateral thalamus » etc.). 


Fig. 4. 


e) Bilaterally tapering bundles on both sides of a plane of mirror sym- 
metry in the nervous system (the median sagittal plane in most animals) indi- 
cate the presence of a so-called commissure, i.e. of a one to one connection 
via fibers of symmetrical points of the two halves of the nervous system 
(Fig. 5). This is a frequent pattern in vertebrate brains. 
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Fig. 5. Fig. 6. 


f) Not strictly speaking a bundle, but a system of fibers connecting 
(in the one dimensional array) all subdivisions of an organ a certain distance 
apart (Fig. 6) would have uniform thickness in its middle part (provided that 
the length of the total array is more than twice that distance) and would be 
seen tapering at both ends. There are examples for this in the cerebral cortex 
(horizontal fibers in layer IV of the striate area). 


g) A system of fibers connecting each subdivision with every other sub- 
division of an organ, in the one dimensional 


array would have the shape given by ni — n? eS ae 
(number of fibers in the n-th of à subdivi- a= = 1 
sions) (Fig. 7). Examples of this may also SS === 
possibly be found in certain arrays of fibers Fig. 7. 
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within the cerebral cortex (example: inner stria of Baillarger in the 
primary acoustic area on Heschl’s convolution). 


h) If two halves of a surface are connected with the corresponding halves 
of another, parallel surface through two crossed bundles, each composed of 
parallel fibers, the awkward splitting of the pattern of excitation which is 
produced by this arrangement 
becomes understandable only in 
connection with the analogous 
splitting produced by the two 
lenses of two separate eyes whose 
visual fields are not overlapping. 

Fig. 8. The crossing of the optic nerves 

annuls the disrupture of continuity 

produced by the two separate optic receivers. The crossing of the vast majo- 

rity of long, bilaterally symmetrical bundles in vertebrate brains is said to 
be secondary to the crossing of the optic connections. 
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î) Bundles composed of fibers of different thickness and therefore dif- 
ferent velocity of conduction (Sect. 3-c) produce a temporal scattering of pat- 
terns of excitation transmitted. This could be an essential prerequisite for 
many mechanisms which can be theoretically postulated. Again, as in the 
case of loops of fibers as delay organs (Sect. 4-c) we are ignorant about the 
physiological significance of such an arrangement. The many existing col- 
lections of fibers of varying thickness may also be interpreted as mixtures 
of different bundles each with uniform fiber thickness. 


5. — The griseum. 


A survey of existing nerve structures shows the general rule that, as insu- 
lated fibers tend to conglomerate in parallel bundles, the non insulated extremes 
of fibers tend to lie in delimited regions. This allows the extremes of many 
fibers to come into close non-insulated propinquity. A general distinction of 
two types of regions or of «substance » within the brain can be made, the 
« white substance » or album containing insulated fibers, and the «grey sub- 
Stance » or griseum containing non insulated extremes of fibers. The diffe- 
rence in shade expressed in the two denominations stems from the greater 
amount of whitish fatty insulating material (« myelin ») in the album. 

The griseum is very varied in its fine structure and may be classified in 
various ways. We abstract some general characteristics concerning the re- 
lation between fibers and griseum on one hand, and some characteristics of 


Symmetry on the other in order to arrive at a rough classification of the 
griseum. i 
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a) The ratio greyjwhite. Under the assumption that the structure of a 
| grey organ corresponds in coarseness to the coarseness of the fiber bundles 
connected with it, i.e. that the density of points capable of distinct states in 
a cross-section of fiber bundle is of the order of the density of such functionally 
distinct points in the griseum, the volume of a grey organ compared to the 
cross-sectional surface of all fiber bundles connected with it represents a measure 
of the complexity of the operations performed. This measure, which is avai- 
lable for many grey organs existing in nature, is generally referred to as the 
« grey/white index ». 


b) Total convergence or divergence. Considering a grey organ with dis- 
tinct incoming and outgoing fibers, the relative size of the two bundles may 
indicate convergence or divergence. A complete classification machine (UTTLEY), 
classifying all possible constellations of activity in n input fibers would show 
great divergence (n:2"). A scanning device capable of transforming some spa- 
tial array of excitation into a temporal sequence would show convergence etc. 


ce) The influence which incoming fibers exert on outgoing fibers of the 
same griseum may be direct, or mediated through other fibers (internal fibers). 
These interactions vary according to the size, shape and relative position of 
pickup and excitation fields of the various fibers involved. 


Pickup fields. Although for any specific fiber the locus of the points in 
which excitation may determine an event can be described anatomically as 
the surface of the corresponding « somatodendritic tree », the distribution of 
pickup points provided by such trees is more conveniently described as a den- 
sity field around the endpoint of the fiber. This field may be approximated 
by some exponential function of distance as in the simplified example consi- 
dered by SHoLL (1953). The size 
and outline of such pickup fields 
can be obtained through direct 
histological observation of the A 
griseum. The size varies for diffe- 
rent grey organs and for different 
types of fibers, the variation (in | | 
human brains) being contained 
between a diameter of about 10 um Fig. 9. 
and that of about 3 mm. 

The larger pickup fields may be overlapping with as many as 10°--10* 
other pickup fields. Some (Purkinje cells of the cerebellar cortex) never overlap. 

The outline (Fig. 9) is spherical (A), or that of a flat box (8) (Purkinje 
cells), or, most frequently in the cerebral cortex, comparable to a very elon- 
gated pear, or a round bottle with a long neck (0). 
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d) Fields of excitation. If we take the terminal arborization of an axor 
as the anatomical expression of the corresponding field ef excitation, we find 
that the outline and size of these fields is even more varied in different grey 
organs than the outline and size of pickup fields. The complete field of oma 
tation corresponding to one fiber may be composed of several separate regions, 
due to the possibility of branching of fibers (Sect. 4-b). Some extreme varia- 
tions are the following (Fig. 10): 


Fig. 10. 


Field of excitation coinciding nearly with pickup field of the same fiber 

(example: granula of the cerebral cortex) (A); 

field of excitation far removed from the corresponding pickup field 
(example: Betz cells of the cerebral cortex) (B); 

very dense branching within a narrow region (example: «specific affe- 
rents », 2.6. sensory incoming fibers of the cerebral cortex); 

diffuse branching over large regions (example: reticulo-cortical neurons); 

long straight unbranched and uninsulated fibers, producing considerable 
temporal scattering in a linear field of excitation (example: « parallel fibers » 
of the cerebellar cortex) (0); j 

chains of separate fields of excitation (example: « basket cells » of the 
cerebellar cortex) (D). 


e) Internal fibers. An internal fiber of a grey organ is one that remains 
entirely within that organ, or in other words, a fiber which is neither afferent 
nor efferent. The number of such internal fibers varies greatly in different 
regions of the griseum. They may be missing altogether (examples: « dentate 
nucleus of the cerebellum », many grey organs of the «brain stem ») or cor- 
respond roughly in number to the number of afferent or of efferent fibers 
(example: «retina ») or outnumber greatly the external fibers (example: cere- 
bellar cortex, where the number of « granula » is about 3000 times the number 
of « Purkinje cells », the only efferent fibers). The preponderance of the in- 
ternal fibers is evidently related to the grey white index (Sect. 5-a). 
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The system of internal fibers may be arranged anatomically «in series » 
with the external fibers (e.g. retina). In the simplest case this arrangement 
would simply iterate the transformation which a pattern of excitation under- 
goes in passing through the region of interlocking pickup fields and fields of 
excitation. 

In some instances (example: « granular layer» of the cerebral cortex) the 
internal fibers form a closed assembly of very small and very numerous ele- 
ments of the type of Fig. 104, among the much larger external fibers of the 
same grey organ. It is possible that this provides for a « quasi analogue » 
mechanism capable of smoothing out the more « digital » operation of the larger 
elements. 

Finally there exists, in many grey organs and particularly in the cerebral 
cortex, a dense population of relatively thick (fast conducting) long internal 
fibers, spanning thousands of the above mentioned granula. This population 
varies in density and arrangement in different regions of the same grey organ 
(sometimes showing the patterns described in Sect. 4-e,f,g) and promises to 
give important clues about the variations of the structure of the net and their 
relation to the operations performed. 


6 — Symmetry of the griseum. 


The fine structure of the nerve net may be uniform throughout a grey 
organ, or may vary from point to point. Moreover a statistics of the meshes 
of the net may or may not reveal directional differences. Accordingly, histo- 
logical sections from the same grey organ may show quite different pictures 
depending on the direction of the cut, and on the region from which the sample 
was taken. 

A classification of the grey organs can be based on the class of possible 
different cuts which give indistinguishable histological preparations. Such a 
classification promises to be closely related to a fundamental classification 
of different types of transformation of patterns of excitation within the 
brain. 

The considerable randomness which makes nerve nets different from crystals 
does not disturb us if we adopt the operational criterion of « distinguishable 
sections ». Thus «randomness » becomes equivalent to «no pattern », and the 
classification is based on characteristics which clearly emerge from the back- 
ground of randomness. With this premise we may use the language of « sym- 
metry » in order to define different types of nerve nets. The symmetry of a 
nerve net is defined by the group of the translations, rotations and reflections 
of the plane of the section which produce indistinguishable histological prepa- 


rations. 


34 - Supplemento al Nuovo Cimento. 
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a) Full symmetry: random nets. The statistics of the connections between 
two points of a random net is entirely defined by their distance. Hence any 
cut looks like any other cut and we may apply the concept of «full sym- 
metry possessed by space itself » (WEYL). Many small grey organs of all brains 
are of this type. In the larger masses however we can always detect some 
degree of directional organization. 


b) Full symmetry of the plane: random plane. The largest portion of the 
griseum in the human brain (progressively smaller as one descends the animal 
scale) is of this type. Cuts parallel to one direction 
(the « thickness ») are indistinguishable. In the direction 
of the thickness the statistics of the elements varies 
asymmetrically, i.e. there exists a non repetitive strati- 
fication (Fig. 11). 

This type of griseum is called cortex. It appears 

Fig. 11. generally in the form of large and relatively thin sheets 

whose extension may be (100--200) times their thickness. 
Afferent and efferent fibers are usually related to two different levels of 
the thickness-stratification, which can therefore be recognized as the anato- 
mical substrate of the transformation performed by the organ. 


c) Double translatory symmetry in the plane. (Cortex with lattice struc- 
ture). This type of symmetry is found in a portion of griseum which in all 
vertebrates is devoted to equilibrium and to 
the fine regulation of movement: the cerebellar 
cortex (Fig. 12). There are two entirely diffe- 
rent sets of connections in two perpendicular 
directions of the plane, as well as a periodicity 
(about 50 um in one direction and 200 um in 
the other) which defines a clear lattice type 
symmetry. The direction perpendicular to the 
plane of the lattice is again asymmetrically 
organized. Even more striking than in the previous type is the sheet like 
expansion of the « cerebellar cortex », where the maximum is about (1 000 —2 000) 
times the thickness. f 


d) Translatory symmetry in one direction (the symmetry of a band orna- 
ment) characterizes the spinal cord, which is subdivided into identical segments, 
and where to a good physiological approximation the influence exerted by 
activity in one segment onto the next segment is alway the same. 


e) Translation and rotation (through 90°) characterize the anatomical 


relation between two successive chiasmas and two successive cortices in the 
eye stalk of crustaceans, 
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f) Finally, it might be useful to mention some types of symmetry which 
do not occur in animal nervous systems. Thus the full Symmetry of the plane, 
associated with translatory symmetry in the direction perpendicular to it 
characterizes a voltaic pile, but not any existing nerve net. Similarly, there 
are no organs with rotational symmetry, with the exception of the very loose 
and quite primitive nerve nets of animals whose body as a whole is of this 
morphological type (example: starfish, medusa). 
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My plan is to concentrate exclusively on one area of investigation, viz. 
the relationships between reticular formation of the brain stem and the 
transmission of visual messages to the cerebral cortex. There will be a 
further limit to the scope of the present review, since the problem of the 
interrelations between specific and diffuse projection systems will be approached 
from the angle of visual habituation. This drastic fence building is made 
necessary by time limits. 

As an introduction, I should like to recall briefly the functional anatomy 
of the visual system. 


1) The story begins with the receptor-transducers, i.e. with cells lying in 
the so-called sensory organs. They are endowed with the ability to transform 
physical or chemical stimuli into nerve messages. The rods and the cones of” 
the retina are the visual receptors. They transform photic stimuli into visual 
messages. There are about 125000000 rods and cones in the human retina. i 


2) We are not interested here to know how the rods and the cones give 
rise to the nerve impulses constituting the visual messages. Let us state that 
the nerve impulses are conducted through two other types of retinal cells—the 
bipolar cells and the ganglion cells—to the optic nerve. There are about 
1250000 fibers in each human nerve. 


3) Each nerve impulse is a potential oscillation, whose size and shape 
is constant for a given nerve fiber and which is conducted at a constant speed 
in each nerve fiber. Hence any visual information which is conveyed to the 
brain may be characterized only by the patterns of spatial and temporal arran- 
gement of the impulses coursing along the optic nerve. 
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4) Again we are not interested in the anatomical details of the cerebral 
projections of the visual fibers. It is enough to state that the impulses are 
relayed by the nerve cells of the lateral geniculate body to the visual area, 
which is located in the occipital lobes of the cerebral hemispheres. The visual 
System we have outlined so far is classified as a specific projection system, 
and the visual cortex is regarded as one ot the specific projection areas. The 
adjective « specific» means that all the structures considered above, and in 
particular the cortical projection area, subserve exclusively visual functions. 


5) From the specific projection area the impulses may be transmitted 
to other areas of the cerebral cortex, whose function is to elaborate and to store 
the visual information. 


I hasten to emphasize that this scheme has by no means been falsified by 
the new developments, that have occurred in neurophysiology during the las 
ten years. There is little doubt, however, that it must be completed with the 
results of more recent experiments (Rossi and ZANCHETTI, 1957). 

First of all, we have recently learnt that the specific systems are not the 
unique channels through which the sensory organs may influence the brain. 
The visual impulses exert a generalized influence all over the cerebral cortex 
through an entirely different system, the ascending reticular system. This 
system takes its origin in the medulla and extends upward through the pons, 
and the midbrain. It receives impulses from the collaterals of all kinds of 
sensory pathways and persistently conveys the neural messages upward, where 
a diffuse system distributes them widely upon the cortex. One role of this 
ascending reticular system is arousal and the maintenance of the waking state. 
When this system is interrupted at midbrain levels the animal is deeply som- 
nolent and it is only barely arousable momentaneously by intense stimuli. 
In this animal the visual impulses will still reach their specific projection areas 
in the cerebral cortex. However the animal is apparently unable to make 
visual discrimination or to recognize visual patterns. 

Summing up, perceptual integration requires a close collaboration between 
the classical specific projection pathways and the ascending reticular system. 
Our first task will be to try to understand how such integration is made pos- 
sible. 

We are now coming to grips with a second concept, which we owe to the 
discoverer of the electroencephalography, HANS BERGER. The importance of 
this discovery, which was made 30 years ago (1929), can hardly be over-em- 
phasized, since it gave the demonstration that the neurons of the cerebral 
cortex are spontaneously active also when they are supposed to be at rest, 
such as during sleep or mental relaxation. The EEG, whether recorded from 
man or from animals, shows continuous fluctuating potentials at the surface 
of the head. Its most outstanding component is the alpha rhythm, which is 
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characterized by a frequency of around 10 cycles per second. An important 
property of the alpha rhythm is represented by its disappearance when the 
eyes are opened. The co-ordinate pulsation of large group of neurons, which 
is responsible for the « waves, is supposed to be thrown out of synchrony in 
these experimental conditions. This disappearance of high voltage slow waves 
can be easily reproduced in all mammals whenever they are aroused by a 
startling sensory stimulation. It is called EEG arousal or arousal reaction 
and is observed throughout the dorsal extent of the cerebral cortex; it is pro- 
bably mediated by the ascending reticular system (Moruzzi and MAGOUN, 
1949). 

It should be emphasized that all the new developments in neurophysiology 
would never have been possible without BERGER's discovery. Let’s consider 
again the visual system. From the retina up to the projection areas of the 
visual cortex the central nervous system was once thought to behave pas- 
sively, i.e. one believed that the stimulus characteristics were simply trans- 
formed into spatial and temporal patterns of visual messages reaching the 
occipital lobes of the cerebral cortex. BERGER’S discovery and the findings 
which were prompted by it have definitely shown that the cerebral cortex 
is an active, dynamic mechanism whose responsiveness is continuously con- 
trolled by nervous structures lying in the reticular formation of the brain stem 
and in the midline nuclei of the thalamus. Indeed we have learnt that the 
retina itself is active in complete darkness and it is very likely that its activity 
and its responsiveness to the physical stimuli are steadily controlled by the 
central nervous system through efferent fibers coursing in the optic nerve. 
Thus sensory stimuli do not elicit, but simply modulate, an activity which 
would go on spontaneously in their absence. 

The main conclusion to be drawn from all these considerations will be that 
a given photic stimulus may evoke quite different responses in the visual cortex 
according to the background activity of the visual cortex itself, the presence 
or the absence of other Sensory stimulations, or any internal condition like 
emotions, bodily needs, and so on. This conclusion, after all, fits very nicely 
what we know from sheer introspection. Everybody knows that, for a given 
intensity ot physical stimulus, our sensation may be increased by attention 
or decreased by habituation. It is possible, however, to substantiate this 
assumption with objective, electrophysiological metheds. 

Whenever a volley of visual impulses impinges upon the specific projection 
area of the cerebral cortex a potential oscillation may be led from its surface. 
It is called «evoked potential» and is mainly related to the response of the 
neurons of the visual cortex to the afferent volley. Its size is usually regarded 
as a rough index of the number of the cortical units responding to the visual 
messages, 4.6. of the intensity of the Sensory response. 

We shall be concerned here only with the changes of the photically evoked 
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| potentials during habituation. Habituation has been defined as «a process 


whereby certain sensory stimuli by repeated application lose significance to 
the individual». Everyday experience indicates that this phenomenon is ac- 
companied by a decreased awareness of the corresponding sensory stimulus. 

HERNANDEZ-PEON (1955) has recently quoted his unpublished experiments 
with GUZMAN-FLORES and ALCARAZ, showing that the evokes potential re- 
corded in the subcortical visual relay, the lateral geniculate body, is decreased 
during habituation. He has reported, moreover, that the introduction of an 
auditory stimulus will abolish visual habituation (dishabituation). Quite re- 
cently CAVAGGIONI, GIANNELLI and SANTIBANEZ (1959) have confirmed these 
findings. They have observed, moreover, that the decline of the evoked res- 
ponse starts earlier in the visual cortex than in the corresponding lateral ge- 
niculate body; in other words, habituation occurs in the specific projection 
areas of the cerebral cortex when the subcortical relays are not yet habituated 
to the incoming volleys coursing along the optic nerve. 

THORPE (1950) has defined habituation as «an activity of the central ner- 
vous system whereby innate responses to certain relatively simple stimuli, 
expecially those of potential value as warning of danger, wane as the stimuli 
continue for a long period without unfavourable result ». The functional signi- 
ficance of the phenomenon is implicit in its definition, but we are yet some 
way from understanding its mechanisms. 
we? CAVAGGIONI, GIANNELLI and SANTIBANEZ (1959) have shown that a cat falls 
asleep when it becomes habituated to a photic stimulus, at least if the animal 
is kept in a sound proof room and all other stimuli are avoided as far as pos- 
sible. Does the animal fall asleep because the main stimulation is gradually 
losing its functional significance, as a consequence of habituation, or is visual 
habituation the result of the drowsiness of the animal? 

The first hypothesis is disproved by an observation of SHARPLESS and 
JASPER (1956): cortical habituation may be absent, and indeed the primary 
cortical response may be increased, when the animal is beginning to fall asleep 
as a consequence of the monotonous repetitive stimulation. 

Let us take into consideration the second hypothesis. It is seemingly sup- 
ported by two important findings. 


1) Dumont and DELL (1958) have shown that the potential evoked in 
the visual cortex by an electrical shock applied to the optic nerve may be 
greatly facilitated by stimulating the reticular formation, 4.e. by experimental 
conditions reproducing the EEG arousal. An effect opposite in sign occurs 
during habituation, when EEG sleep patterns are present throughout the ce- 
rebral cortex. 


2) LINDSLEY (1957) has recently reported a second group of findings, 
which also relate the evoked response to the ascending reticular system. In 
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the human subject two brief 10 microsecond flashes of light presented 150 ms 
apart are easily seen as two, and similarly at 100 ms separation. At 50 ms 
separation they are seen as one. In the cat and monkey the same two flashes 
of light at 150 and 100 ms separation will produce two distinct evoked po- 
tentials in the visual cortex, but at 50 ms separation only one such evoked 
potential pattern can be detected. If the reticular formation is stimulated 
electrically two distinct evoked potentials will be recorded even for intervals 
of 50ms. Thereafter the response will revert to a single evoked potential as 
before reticular stimulation. Thus it is evident that stimulation of the ascending 
reticular system in some way facilitates the visual cortex, so that it can re- 
solve two brief flashes. The conclusion might be drawn that just the opposite 
effect occurs when the animal becomes drowsy and that this reduced efficiency 
of the visual system during drowsiness is the cause of visual habituation. 


The inadequacy of this explanation becomes apparent, however, on closer 
examination. 


1) First of all if habituation were the result of a general decrease in the 
activity of the alerting or attention mechanisms, then it should be observed also 
for other sensory stimulations. There is little doubt, however, about the strict 
specificity of habituation not only to the modality (say the sound) but also to 
the quality (the pitch) of the repeated stimulus (SHARPLESS and JASPER, 1956). 


2) The animal may fall asleep when habituation is not yet present in. 


the sensory cortex (SHARPLESS and JASPER, 1955). 


3) Although lacking after deep anaesthesia or in the coma produced 
by interruption of the brain stem reticular formation, habituation to an audi- 
tory stimulation can be observed in the cat during spontaneous sleep (HER- 
NANDEZ-PEON, JOUVET and SCHERRER, 1957 È 


Hence the interruption of the sensory inflow which characterizes habituation 
is not causally related with generalized phenomena such as sleep or wake- 
fulness, although it may be in some other way related to them. 

There is little doubt that habituation arises in the higher parts of the 
central nervous system. If the blockade occurred in the retina itself or even 
in the lateral geniculate body, this would amount to a discontinuation of the 
habituating stimulus. Since it is well known that the habituated response 
recovers when the stimulus is discontinued, the conclusion should be drawn 
that sensory volleys still impinge upon the central nervous System when the 
evoked potential is absent or reduced in the visual cortex. Hence it would 
be perhaps wiser to state that during habituation the response to the visual 
volleys is different, rather than to Say that it is abolished. The crux of the 
matter is to see what this difference is and why it comes to light as a conse- 
quence of a monotonous repetitive stimulation. 
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It would be dangerous to assume that the absence of an evoked potential 
means that no activity is elicited by sensory stimulation in a given nervous 
structure. It would be safer to state, say ,that during habituation the cortical 
neurons of the habituated cortex react in such a way when they are impinged 
upon by the visual volleys that an evoked potential can no longer be obtained. 
JUNG, von BAUMGARTEN and BAUMGARTNER (1952) and JUNG and BAUM- 
GARTNER (1955) have clearly shown with their microelectrode studies that in 
the visual cortex there are several neurons whose response to visual messages 
is merely represented by an inhibition of their spontaneous discharge. This 
inhibition cannot be detected by leading from the cortex with macroelectrodes,. 
as usually done in the electrophysiological studies on habituation. 

Several considerations had already led to the conclusion that some kind 
of inhibition is responsible for habituation: possibly a phenomenon akin to: 
Pavlov’s internal inhibition. A peculiar type of inhibition, I hasten to add, 
since the same neurons will easily respond to the same sensory stimulation 
if other stimuli are applied simultaneously. This phenomenon, called des- 
habituation, hints that the inhibitory process which is responsible for habi- 
tuation will fade away (or not appear) under the impact of other messages. 

We are thus coming to grips with the core of our problem, a hard core 
and one which has not yet been solved. I am now going to present a few theore- 
tical considerations, which may serve as a background for discussion and pos- 
sibly for future experimental approach to our problem. 

There would be no difficulty in thinking that sheer repetition of an iden- 
tical stimulus would gradually increase the number of those visual cortical 
neurons which respond with an inhibition of their spontaneous discharge to 
the incoming volleys. This hypothesis may be wrong, but it is one anyway 
which can be approached experimentally. We hope to test it with micro- 
electrode studies. If it is confirmed, the next step will be to investigate the 
mechanism of this gradual increase which is responsible for this type of inhi- 
bition. 

It is usually stated that a servomechanism is an automatic regulatory device 
actuated by the difference or «error » between a desired reference input and 
the actual value of output. In this way a constant input is maintained, a 
phenomenon that is exemplified by the usual pupil reflex to light. 

We are confronted here with an entirely different phenomenon, since it 
is the lack of an error which apparently evokes the habituation mechanism, 
whose tendency is actually to obliterate a constant input. 

A reversal of the response of a single neuron—from excitation to inhibi- 
tion—depending upon the background activity has been observed (von BAUM- 
GARTEN, MOLLICA and Moruzzi, 1954). The mechanism of the phenomenon 
is still unknown. The hypothesis might be advanced that any regular, mono- 
tonous repetition of an identical stimulus would bring about—by a pheno- 
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menon of temporal facilitation—an avalanching increase of activity in the 
inhibitory neurons, which are probably present both in the specific projection 
system and in the reticular formation. In the visual cortex a gradual decrease 
of the evoked response would occur, while at reticular levels this effect would 
be responsible for the appearance of sleep. Thus habituation of the cortical 
sensory response and sleep induced during habituation would not be causally 
related, although both phenomena would be the consequence of a basic pro- 
perty ot the central nervous system operating at different anatomical sites. 
I hasten to say that inhibitory responses are elicited in several reticular neu- 
rones also by a single stimulus (von BAUMGARTEN and MOLLICA, 1954). Our 
hypothesis would simply predict that these inhibitory effects would be strik- 
ingly increased by sheer repetition of the stimulation. 

What we know about habituation might be explained quite well with this 
hypothesis. 


1) Dishabituation elicited by other sensory stimulation would be due 
to inhibition of an inhibitory process, as PAVLOV had already surmised. Con- 
ditioning obtained by a nociceptive stimulus would act in a similar manner. 


2) General anaesthesia would depress more severely the inhibitory mecha- 
nisms, thus abolishing (or preventing) habituation. 


3) The fact that the animals become habituated more rapidly on the 
second and third days of recording than on the first, hints that a state of 
habituation persists from one day to the next. This obviously relates habi- 
tuation to learning, a similarity that had been already pointed out by THORPE 
(1950). 
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Introduction. 


In the context of a Course on information theory, a psychologist feels 
strongly tempted to start his speech with a series of caveats. I will say only 
this, that in spite of the assurances that you have been given it is still not 
true that a man can be considered a priori as anything predictable. I would 
also say that all the assumptions made by mathematicians and engineers 
about men were certainly false, so that if I contradict anything said by anyone 
else up to this point, I am asserting categorically that they are wrong and I 
am right. 

It is perhaps almost a commonplace that a man can be regarded as a com- 
munication system. He receives sensory impressions (information) from his 
environment (a source). This information is transformed (recoded) initially 
in the sense organs and subsequently in successive centers of the nervous 
system. It is transmitted over nerve trunks from one station to another and 
amplified in the course of its transmission. It is stored both for short times 
necessary to effect sequential recoding and for long times in something that, 
remarkably enough, is called memory. Finally, the information is retransformed 
into environmental events that contain some fraction of the information that 
entered the sensory end of the channel. The system is noisy as any real system 
is bound to be, but it is amazingly economical, highly portable, easily pro- 
grammed and has been proved out in service in essentially the present model 
for at least 5000 years for which we have records. Quite a machine! 

At the same time, many people have had the reverse idea, that nature, 
including machines, often imitates the behavior of men. Primitive man ascribed 
human motives to the wind and storms, to the mountains and to nature about 
him. He saw devils and spirits in the inanimate world. The recent fascination 
with robots and «Giant Brains» has about it some of the same uncritical 
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bemusement with physical systems. The building of man-like machines will 
not occupy me here because I feel that sweeping analogies are cheap and 
contribute rather little to understanding human behavior. 

Before discussing specific experimental results, it should be pointed out 
that any given physical unit may be regarded as any given part of a communi- 
cation system depending entirely upon what our purpose is. Thus a man is 
clearly a source of a message or a destination when the engineer is talking 
about a communication channel such as a telephone. But he may equally 
well be regarded as the channel between some physical source and behavioral 
end result. Or again his behavior may be the source for experimental and 
statistical procedures leading to some inference. 

At the present time it is most difficult to say very much about a man as 
a source. We know something about the statistical properties of some of the 
signals that he emits, such as the sequence of sounds of speech. Recently 
there have been hints that more can be done with this area, generally in the 
direction of statistical models, of which a MARKOV process is a simple example. 
I am referring to recent work of NoAM CHOMSKY and GEORGE MILLER. 
Unfortunately, I am not competent to discuss these developments. 

It would also be possible to describe a man as the receiver of signals and, 
in a sense, someone’s ability to receive information is almost always presumed 
as the goal of any system. Machines do not run for themselves but they are 
created to serve men. Once more this is a problem beyond our present abil- 
ities and I shall mention it later only in an oblique way. 

This brings me to the third possibility, and that is to regard man as a 
channel. Such was the general descripticn with which I started. What I am 
going to do is to discuss the input end, that is, sensory and perceptual processes 
which encode the input. Then I shall turn briefly to the output end and return 
finally to the central part of the process. 


1. — Information in single perceptual displays. 


One of the most obvious questions to ask in terms of information theory 
is what is the capacity of our senses, the eye or ear, to transmit information? 
How does it compare with the channel-capacity of a telephone or television 
circuit? Before trying to answer this question let me make three points to 
clear away some ambiguities. 


1°1. Most discussions of channel-capacity, and particularly the relation 


of channel capacity to band width, are based on signals considered in real 


time. On the other hand, when we are dealing with speech, with symbolic 
and logical operations, and with other human activities, the natural scale is 
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… in terms of events. Speech consists of a sequence of phonemes. The information 


conveyed is invariant with number of phonemes rather than with real time 
which is a less meaningful variable. The two types of scale are not unrelated, of 
course, but in our first approach to many problems we shall use the scale that 
is most convenient, and shall leave for the future the reconciliation of data 
from the other. 


12. In discussing the question of channel-capacity we are indulging in 
an illegal activity. Specifically, we are asking certain empirical questions 
rather than formal questions. As it was originally introduced by Shannon, 
the capacity of a channel, C, is a quite formal concept. The empirical questions 
on the other hand, are interesting ones and I believe they are scientifically 
fruitful. I hope that it will remain perfectly clear that my use of the term 
channel-capacity does not carry with it any formal implications. 


1°3. In dealing with an informational analysis of human behavior, there 
is one very serious limitation that must be studiously ignored. This is the 
problem of sampling. Let me take speech as an example. For purposes of 
analysis we assert that certain sequences of words have a certain very small 
yet real probability of occurring. Actually, in any practical sample they never 
occur. The probabilities are too small. And it would take more than a man’s 
lifetime for him to utter enough sentences to constitute a good sample. On 
the other hand, some utterances do occur, and the fact that they occur may 
well be statistically highly improbable. To add the coup de grace to our other 
troubles, we note that people are not highly stable systems but they change 
as a consequence of experience. No man is ever twice just the same person. 
Thus we conclude that in fact no statistical model can ever be proven correct 
for all the phenomena with which it is supposed to deal. It is just one of these 
lovely fictions, a bit of the Als-Ob that makes scientific life worth living. 


a) One way of estimating the capacity of the human senses is to build 
upon the observation that the sense organs and the nervous system are discrete 
in certain important ways. First of all, the peripheral sense organs and the 
nerve. trunks that serve them are made up of discrete cells capable at times 
of independent action. Furthermore, the action of most nerve cells is all-or-none; 
a discharge once started is propagated at maximum amplitude, and such a 
discharge is followed by a refractory period before another discharge can 
be initiated. Thus, it is easy to suggest that the nervous system is like a 
multiple-wire cable over which pulses are transmitted. Each pulse on each 
wire represents a point in a probability-space, and by the same token so does 
each discharge over each fiber. From these observations one could calculate 
a rate of transmitting information. I am persuaded that this picture is too 
simple but I shall not go into that now. 
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Another well-known stepwise phenomenon is the existence of a threshold 
in sensory responses. One such threshold is the minimum strength of a stimulus 
that can be reported. Below the threshold there is no reportable effect; above 
the threshold, an observer sees a light or hears a sound. There is also a threshold 
for differences in any discriminable aspect of stimulation. Given a sound of 
any loudness, a second tone will be heard as equally loud until the intensity 
has been increased more than a threshold amount. These two thesholds suggest 
a kind of psychological scale with discrete steps for each sensory attribute. 

A very little consideration makes clear that this view is also far too simple. 
For one thing, the existence of the absolute or stimulus threshold says nothing 
about the magnitude of supraliminal sensations, and therefore nothing about 
how much information is contained in an experienced magnitude. It sets a 
kind of a boundary, and that is all. Moreover, there is an ancient paradox 
connected with the differential threshold. Suppose that you can just distin- 
guish between 100 grams of weight and 110 grams. What will happen if you 
now take 105 grams as your standard? Will the sudden step still come at 
110 grams? The experimental fact is that the noticeably heavier weight will 
be 115. Quite evidently there is a continuous variable that underlies the sal- 
tatory effect that we call the threshold. Another form of the paradox is that 
100 and 106 will be judged equal. So will 106 and 112, 112 and 118, and 
yet 100 is very different from 118. In logical terms, the equation is not transi- 
tive, for 100 and 124 are easily distinguished. In our present context we are 
interested in the reverse argument: it is not proper to argue from the fact 
of a threshold to the concept of a stepwise scale. The facts suggest that our 
variables are in some cases continuous (in computer terms, analog) transforms 
and that the threshold is a particular function (digital) superimposed on the 
underlying scale. 

In passing it should be mentioned that both the discreteness of nervous 
response and the threshold have been used as a basis for calculating the infor- 
mation received by the ear or by the eye. These estimates turn out to be very 
large; I shall not repeat them here because they are pretty obviously wrong. 


b) As an alternative procedure for measuring the capacity of the sensory 
channel we turn to more direct measurement of transmitted information. The 
procedure is straightforward. A set of Stimuli, such as notes on the piano, 
are presented to a subject who identifies the notes presented in accordance 
with a pre-arranged code. As the size of the set of stimuli is increased a point 
is reached where there are a significant number of confusions. From the 
resulting input-output matrix (or confusion matrix) it is a straightforward 
matter to compute transmitted information. 

Now for some experimental results. POLLACK (1952) has reported results 
for pitch. From 2 to 14 different pitches were presented with results shown 
in Fig. 1. Up to 4 pitches were identified perfectly; beyond that number, 
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errors increased rapidly and the curve becomes asymptotic to a limit at about 
2.3 bits per stimulus. Changes were made in the range of frequencies employed, 
in the distribution of stimuli within the range, and in the loudness of the tones. 
These, and doubtless many other possible 
variations, do not increase significantly 
the information transmitted. 

Both POLLACK (1953), POLLACK and 
Ficks (1954), and GARNER (1953) have 
done the same experiments with loudness. 
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Tones of various intensities are presented 
and the listener has to label them accu- 
rately. Fig. 2 shows GARNER’s results after 
partialling out disturbing intraserial ef- 
fects. Once more, the curve rises with a 
slope of 1.0 to just over 2.0 bits per 
stimulus and then levels out with 2.3 bits 
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Fig. 1. — Information trasmitted by 
presenting one of a set of tones (PoL- 
LAK, 1952). 


as a limit. 

The saltiness of salt was tested in the same way by BEEBE-CENTER, ROGERS 
and O’CONNELL (1955) with results shown in Fig. 3. The tongue is less acute, 
according to their results, than is the ear. The maximum lies at about 1.9 bits 
i per stimulus representing an accurate 
discrimination of about four levels of 
saltiness. But while the decrease from 2.3 
to 1.9 bits is significant, it is trivial in 
contrast with the broad fact that infor- 
mation is almost invariant when stimulus 
magnitudes are chosen from widely dif- 
ferent ranges. GARNER’s tones covered a 
range of 95 db while the salt solutions 
varied in concentration by a ratio of 1 
to 100. (We might call this 20 db). 

The eye does slightly better, in some 
respects. HAKE and GARNER (1951) report 
results from an experiment in which sub- 
jects estimated the position of a pointer between two marks on a scale. The 
stimulus positions that were utilized divided the interval into 5 parts, into 
10 parts, into 20 parts and into 50 parts. The results are shown in Fig. 4, in 
which the open circles indicate results when the subject knew how many 
positions were being employed, and the filled circles judgments without 
knowledge of the stimulus positions. The asymptotic value of 3.25 bits per 
presentation is about one bit higher than the values for pitch and loudness. 
It is only fair to point out that in this case there is one important difference 
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Fig. 2. — Information trasmitted by 
presenting tones of varying intensity 
(GARNER, 1953). 
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from the previous experiments. This is the simultaneous presence of anchoring 
stimuli at both ends of the scale. PoLLACK found that a single anchor in the 
case of pitch judgments added roughly 0.5 bits. There is the further fact 
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that one of the stimulus positions in HAKE and GARNER’s experiment was 
directly over the scale marks, allowing for no uncertainty in the report. These 
two factors may account for as much as 0.7 bits. Subtracting this, the estimate 
is once more in the same range. 

The single aspect of size did less well in an experiment of ERIKSEN and 
HAKE (1955) using squares of different sizes. Since the shape was constant 
this presumably is equal to a judgment of linear extent. Their highest values 
were 2.2 bits, comparable with the corrected estimate of 2.5 for a point on 
a line. POLLACK (see MILLER, 1956) has reported values of 2.6 and 2.7 for 
area, and 2.6 to 3.0 for length of line. 

While the experimental work is far from being exhaustive, the sampling 
of stimulus dimensions has been reasonably broad. The invariant relation 
that emerges is rather striking for a psychological function. Whenever infor- 
mation is conveyed by systematic variation of a single psychological attribute 
there is a limit to the number of useful steps that may be employed, and this 
limit lies between four and eight Steps, î.e., two to three bits of information, 
over a wide range of conditions. Once the steps are clearly above the dif- 
ferential threshold, neither the range nor location of the stimuli seem to be 
of great consequence, nor is the time allowed for judgment, nor the form of the | 
report. The only added variable of significance is the use of explicit standards, 
and in this instance it appears that not the so-called absolute but rather a 
differential judgement is added in. In some very basic aspects the use of in- 


formation from absolute judgements is a very different process from the dis- 
crimination of a difference. 
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c) In what has been said up to this point about channel capacity there 


_ has been an important basic limitation, namely, that we are sending a single 


message by the choice of a value from a single dimension. The choice turns 
out to be from a rather small set of alternatives, let us say, five or six. But 
a man is not limited in this way. Objects and events vary in many more ways 
than a single attribute and presumably a response can be made at one and the 
same time to several of these attributes. Now the question is, are these attri- 
butes independent of each other, or, does the information in each add to the 
information in the others? The answer is that additivity is sharply limited. 

Once more I should like to interpolate for this audience the remark that 
this is an old question in psychology. It is the problem of attention. It has 
long been known by psychologists that the set (attitude, Aufgabe, Einstellung, 
determining tendency) can be a major determinant of the report that an 
observer gives, and, particularly, that a person set to observe one aspect of 
an object exposed for a brief moment will be less able to report some other 
aspect that might be clear if the set were changed. This marked selectivity 
of the human and, we believe, animal observer must be familiar to all of you. 
Wbat is disturbing is that this filtering, this pre-tuning of the receiver, changes 
with great rapidity and we mistake the sum of a series of successive impressions 
for a single simultaneous one. ; 

Please do not think that the problem is solved by these remarks. It is 
in fact only made more difficult. While it may be true that the demands made 
on the input channel are found to be less severe by virtue of the selectivity 
of attending to one thing at a time, there are posed for the organism two other — 


. problems that are quite as diffieult.. These are, first, the problem of what 


determines the set, and the rapid changes in set, in normal observing, ‘and, 
second, how is all this successive information stored and organized so that 
a man can act as if he had a very large i 
channel capacity at any one moment. 
Having argued against the complete 
additivity of various dimensions, let us 
look at some experimental results. What - 
does happen to channel capacity as the 
number of dimensions in the stimulus is 
increased? In a study by KLEMMER and na AA 
FRICK (1953) the issue is put quite clearly. RE ie RS Sea BISI 
They used a pattern with a dot, shown on Fig. 5.— Information in position of a 
a screen, and asked subjects to mark down point in a diari and 
on a paper with squares where the dot was - : 
located. Actual grid lines on the target and 
on the record sheet made no difference in the results. The fact that the position 
of the dot varied along two coordinates made quite a big difference. The results 
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are shown in Fig. 5. As the number of possible positions on the grid increased, 
the information in the source increased, and the information transmitted went 
right along with it until a fairly sharp limit was reached, at about 4.5 bits. 
This is to be compared with 2.5 bits for a single dimension. 

KLEMMER and FRICK carry the case a step farther. Suppose there are 
two dots in the square, then are four coordinates. And indeed they find that 
the limit rises again. Three dots can be thought of as six coordinates, and so 
on. The actual tests were limited to four dots in a 3 x3 matrix because the total 
set becomes so large as to make any exhaustive sampling impossible. But 
the point that is clear is that the losses in transmission are quite small. In 
their most complex case, from 1 to 4 dots in a 3x3 grid, they were able to 
recover 7.8 out of 8.0 bits of information. KLEMMER (1957) has recently 
extended this study to a small sample of the possible patterns on a 45 matrix. 
For his best subject he reports values between 11 bits and 13 bits. There are 
very large differences among individuals in a short test and it seems probable 
that the training of the subject in reporting becomes very important. Never- 
theless, we are obviously moving to a fairly high level compared with one- 
dimensional tests. 

I should like to comment briefly at this point on the ambiguity of the term 


‘attribute, or aspect, or coordinate. In the KLEMMER and FRICK study it is pos- 


sible to think of four dots each having two coordinates and eacheo ordinate hav- 
ing three possible values. The stimulus information is then 4% 2x1 .6=12.8 bits. 
But it becomes clear that this estimate is too high. Certain instances are 
excluded because they are identical. Obiously, it is more meaningful to 
map the whole thing into a nine-dimensional space — one for each of the cells — 
and think of each dimension having possible values of 0 and 1. Thus there 
are 2° ‘possible patterns — and nine bits of information. I recite this elementary 
instance because it is typical of a very large 
class of problems. The issue is what particu- 
lar language is useful in describing the situa- 
tion to which a person responds. Depending | 
upon one’s choice in this matter, man can 
look quite ineffective or quite effective. Or 
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Fig. 6. - Summary of information multi-dimensional studies. Several have been 

transmitted by multidimensional done, and none of them 1 + À 

displaye (Mixer 1966). , are completely sati- 


sfactory. One trouble is that any systematic 
| testing of displays with, let me say, 30 bits 
IS à rough job. The thirtieth power of two may be a 
small number to people who are dealing with orders of infinity, 


of information 


but when an 


experimenter tries to sample exhaustively such a set, 2% looks enormous. 

A few results are summarized in Fig. 6. The point to be made here is that 
additional dimensions do add more information, but the return falls off as 
the number of dimensions goes up. 


d) There are several lines of indirect evidence to suggest that there is 
an upper limit for the channel capacity, no matter how many dimensions are 


chosen, at 15 to 20 bits per presentation. There is, for example, a good deal . - 


of rather old work on the so-called span 
of apprehension. A set of letters or 
numbers is exposed in a tachitoscope for 
a brief period of time — perhaps 1/50 s. 
The subject reports what he has seen. 
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represent an average man’s limit, which 

adds up to perhaps 20 bits. It is als AE 
P P P ‘ by algo field of four light boxes, each containing 

known that the span increases as the seven individually lighted line segments. 

choice of material is limited to words. In the typical random patterns of lights 

Clearly the number of items goes up shown, the black area represents. the 

as information per item comes down. lighted lines. Bottom: patterns used to 

Another line of .evidence can be Rennes 


derived from eye-movement studies of 


Fig. 7.— Top: appearance of the stimulus 


reading. A good reader makes as few fixations as he can without retracing 


more than occasionally. In one recent study using college students, an average 
of eight fixations for a line of 60 letters were found (CARMICHAEL and 
DEARBORN). This works out, at 2.0 bits per letter in meaningful text, to 15 bits 
of information per fixation. 

I would like to mention finally a quite recent study of KLEMMER and 
LoFTUS (1958). They used a stimulus display shown at the top of Fig. 7. In 
their main series of tests they used a random selection among the 128 possible 
patterns at each of the four positions. These included patterns that might be 
interpreted as numerals or letters. The forms of the numerals is shown in the 
bottom of Fig. 7. One obvious result of the experiment was that the subjects 
had trouble in writing down what they saw. Usually 5 or 6 s were required 


and much of the figures had been forgotten by the time the subject got | 


to it. KLEMMER and LOFTUS resorted to an important technique, namely telling 
the subject after he had seen the set of figures what part to report. This was 
done, for example, by a small red light under the figure. The results are given 
in Fig. 8. They also found very substantial practice effects and some preference 


for continuous or closed figures. Under the best conditions — partial report with 


a prompt cue and some three weeks of daily practice — the best subject got up 
to 26 bits, the average subject up to 22 bits of information perceived. The best 
measure of transmitted information, that is, instances actually reported totally, 
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was about 18 bits. Interesting enough numerals and letters did not show any 
superiority over similarly shaped figures. Familiarity helps, if at all, ing 
remembering the material long enough to permit reproduction. But this is 
another question. 
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Fig. 8. — Percentage of elements correct as a function of the time between stimulus 
and poststimulus cue. Performance on test requiring complete response is shown for 
comparison. Data average of ten Ss. 


To recapitulate very briefly, we found that the information conveyed by 
an item presented along a single dimension was quite sharply limited to about 
2.5 bits and that this amount was invariant over a rather wide range of con- 
ditions, sufficient to suggest that we are dealing here with a basic if ill-defined 
psychological mechanism. We have also shown that the information increases, 


| although not quite proportionately, as the number of dimensions is increased. 


Just what constitutes a dimension or coordinate is left unclear. But even as 


the number of dimesions is increased there is still a limit on the perceptual 


system. ‘This limit is probably between 15 and 25 bits per presentation. 


2. - Perceptual capacity in real time. 


Let me turn briefly from this kind of an analysis in terms of short but 
static presentations to the problem in real time. Can we translate from estimates 
of information per event, or per presentation, to a continuous flow. The most 


immediate observation to be made is that the limit on the « channel» may. 


occur at quite different points. Thus KLEMMER and LOFTUS were able to get. 
into their subjects some 25 bits with an exposure of 1/50 s. Actually the time 
might have been 1/10000$ with an extremely bright light. Obviously such 
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very short times cannot be converted into a rate such that 50 or 10000 fields 
could be usefully presented in a second. For one thing, response is seriously 
limited. Even to get evidence of 25 bits, a procedure had to be adopted that 
called for only a 6-bit output. Furthermore, the sensory mechanism itself 
is limited. We know from studies of flicker fusion that randomly selected 
fields at 50 per s would fuse into a uniform grey and nothing would be seen. 
Once more it is important that we avoid too hasty inferences and turn rather 
to a direct measure of a rate in real time. — 

I do not want to stop here to review studies made on what I might call 
the «straight-through » rates, that is, rates in which the input information is 
reproduced in substantially the same form in the output. These include typing, 
reading aloud, sending and receiving telegraph, piano playing, taking dictation 
in short-hand, etc. Let me just say that all of these rates may well be limited 
on the motor side as well as the percentual side, as has been pointed out by 
QUASTLER. For the present I would like to limit my consideration to problems 
of sensory or percentual inputs in real time. 

Of the various kinds of evidence, the speed of silent reading gives us the 
highest rates on which we have any reasonable evidence. Successive fixa- 
tions of the eyes occur at an average rate of about four per second. If 15 bits 
are perceived in one fixation, then we presumably receive information at a 
rate of 60 bits/s. Quite obviously most of this information is discarded almost 
in the moment it is read, for no one has the ability to recall or to recode infor- 
mation at this rate. 

The other empirical approach to a rate of discrimination in real time is 
through auditory, and particularly verbal, material. The estimates of informa- 
tion are less clear cut than in the case of reading because there is somewhat 
less agreement on the proper segmentation of a flow of speech sounds. For 
example, a fine phonetic analysis carries the implication that the listener 
hears much finer discriminations than he does if a loose phonetic analysis is 
made. In addition, there are clearly nuances of meaning and of emotional 
expression that are carried in the auditory pattern but are missing from the 
written material. 

The problem of segmentation is closely related to the problem of the choice 


. of a level of analysis. Should it be in phonemes, or morphemes, or words, or 


sentences? These are areas into which I do not care to venture, especially as : 
these matters are discussed elsewhere in this Course. What is important 
for the present purpose is to assert that as yet there are no radical differences 


in the estimates of information rate depending upon the level of ana- 


nn catia 


lysis employed — nor is there any great difference between auditory and 
visual. 

Let me put the matter in this way. Presumably a trained listener can 
understand speech sounds somewhat faster than the maximum rate at which 
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they can be spoken (an exception might be a speaker highly trained with redun- 
dant material and a listener able to discriminate but not in possession of the 
full code). The best guess is certainly that the ear can hear between 30 and. 
50 bits per second. | 

To sum up in this question of our discriminative capacity in real time, I 
should like to make the following points: 


1) In any given example there may exist a limiting rate either in the 
perceptual input, in the processing and recoding of the information, or in the 
motor output. Presumably the lowest of these establishes the overall rate. 


2) Except for certain straight-through rates the commonest limitation is 
on storage and output. No person at the end of a minute is ready to do any 
one of a billion (10°) equally probable different things. Yet his senses may 
have provided him with enough information to make such a choice. 


3) Serious as the mis-match between input and output may be within 
the organism, it is even greater between the outside world and the organism. 
The extreme example is, of course, a television channel. I shall want to say 
more later about this discrepancy. 


4) The actual use of a channel in every day life is usually at a fraction 
of its maximum rate. For example, aircraft pilots receive information at a 
rate of perhaps 2 bits/s over their radio telephone channel. The principal 
advantage of these low rates is the protection they afford against noise and error. 


3. — Channel capacity in motor output. 


Somewhat further removed from information theory is the study of man’s 


_ output, particularly in terms of the work that he does. These matters have 


been of interest both to psychologists and physiologists for many years. 
Thus we know a good deal about how strong people are, about the force and 
energy with which various movements can be carried out. There are detailed 
studies of muscular fatigue. There are studies of the precision of movements, 
various kinds of aiming tasks, and of rates of movement, such as tapping and 
various measures of reaction time. But almost none of these studies touch 
on problems that can be put into informational terms. 

The simple fact is, of course, that the behavioral repertoire of most animals 
is amazingly limited so that the selection among various movements that an 
animal can make carries rather little information. Man is fairly far out on 
one end of the distribution in two important respects. First, he appears to have 
somewhat greater differentiation of his vocal apparatus than most animals 
(although perhaps not some birds) and he has independence in the movement 
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of his thumbs and to a less extent of the other fingers, so that he has a fairly 


high degree of manipulative skill in his hands. But these are differences in 


degree and at best they are not great. 


What becomes obvious from an inspection of the gross facts is that the 
tremendous difference in informational output between a cow or a hen on one 
side and a man on the other depends relatively less upon the size of the set 
of possible movements but rather on timing and the sequential patterning 
of behavior. Let me take two simple examples to make my point. A telegraph 
operator makes only one or two basic types of movements involving perhaps 
a dozen muscles. It is no more complex, in fact probably less complex, than 
a hen pecking a grain. But the latter movement is utterly stereotyped and 
the former transmits the contents of a newspaper. Again, studies have shown 
that the corrective movements made to the «stick» in flying an airplane, 
or to the wheel of a car, are not finelly graded, exact movements. The precision 
required in flying a plane from New York to Paris is achieved by making a 
series of 2-bit responses at the right time and in the right sequence. 

After this rather long introduction, let me describe briefly a few examples 
of studies that have used informational estimates of behavior. The first have 
to do with reaction time. This is the time required for a subject to react to 
any specified signal. Characteristically he is given detailed instructions before 
the signal so that he is «set» in a particular way. Most often reaction time is 
measured in a series of single tests, but it is possible to arrange the tests in 
such a way that each response triggers the next signal and one measures a 
serial reaction time. The results are not greatly different. 

Reaction time has been known to be fastest to a single signal. It is sub- 
stantially slower if the subject has to press one of two keys, or to respond or 
not respond, depending on the signal presented. Such choice or conditioned 
reactions can be among several alternatives. Starting from this point 
Hick (1952) showed that it was possible to account for reaction times on the 
assumption that information was being transmitted at a constant rate. He 
had to make one strong assumption that has not been readily accepted. This 
is the assumption that the simple reaction time represents a 1-bit, not a zero-bit 
choice, and that generally the rate is calculated as log (n+1) were n is 
the number of actual alternatives in the disjunctive reaction. Hick found 
the fit of his data to this formulation rather good. Other experimenters have 
not agreed. The alternative view is to suppose that there is a minimum trans- 
mission time that has nothing to do with the processing of information, and 


‘that there is added to this a processing time that is proportional to the infor- 


mation required to make the choice. Fig. 9 taken from a study by HyMAN (1953) 
strongly supports this second view. HyMAN used a variety of techniques for 
changing the «information » in the stimulus input and finds a relatively in- 
variant relationship of information to reaction time. 
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we, 


Not all such studies have been so successful. QUASTLER and BRABB (1956) 
measured reaction time on a typewriter using trained typists. For the full. 
alphabet the reaction time was .53 s. For an 8-letter alphabet it dropped only 
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Fig. 9. — Reaction time (expressed in .001 s) as a 
function of stimulus information (expressed in bits) 
when amount of information is varied in three dif- 
ferent ways (o Exp. I, ci Exp. II, 4 Exp. III). Data 


to .52 s; for four letters 
to .49 s; and for a single 
known letter to .21 s. This is 
very far from either Hick’s or 
Hyman’s result. As QUAST- 
LER and BRABB suggest, 
the short-term instruction 
is probably not effective in 
reducing the complexity of 
the task in the face of a his- 
tory of many hours of prac- 
tice with the full alphabet. 

KLEMMER and MULLER 


(1953) report results that. 


look much 
QUASTLER and BRABB. Their 
task was not a simple dis- 
junctive reaction but rather 
was one that required that 
a pattern of fingers be pres- 
sed simultaneously. More- 
over their tests, with one 


like those of. 


exception, were paced. This. 


are for Ss G.C., F.K., F.P. and L.S., respectively. makes it impossible to com- 


pare the results directly. 

Their date suggest, however that with only one stimulus (0.0 bits) the optimum 
spacing is about .25 s while for the 2.0, 3.0, 4.0 and 5.0 bit task, the reaction 
time was about .40s. The slope of Hick’s functions is missing. It is not. 
immediately evident just why the two, so nearly parallel experiments give 
different results. Perhaps the use of five keys simultaneously instead of the 
single disjunctive reaction is a parallel to the dimensionality of a display. 
That is, KLEMMER and MULLER have added not only information but also 
more dimensions to their task. 
High output rates are found only in highly-skilled tasks. Presumably they 
give us a pretty good estimate of top human capacity. I would mention these 
very briefly. One is the rate of talking — or more properly — of reading. | 
With optimum vocabulary, rates of 24 bits per second have been obtained. 
With limited or highly redundant vocabularies the informational rates fall off 
because of mechanical limitations in the production of sounds. Much the same 
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thing can be said about typing and piano playing. There is a psychological 
limit on the speed with which the fingers can be moved. With a low infor- 
mational output per movement, the limit is the mechanical one. With more 
complex material and more complex coordinations, information per movement, 


goes up. Total information also goes up until the subject begins to make errors 


unless he slows down. Somewhere in this region there is characteristically a 
plateau, with a rate of movement not far below the maximum possible peak 
information in the neighborhood of 25 bits per second, and the information 
per element of the motor performance up to perhaps 5 or 6 bits, but certainly 
not very high. 

It is still an open question whether performances of this kind could be 
achieved with quite different motor systems if the subject had years of prac- 
tice. There is a fair chance that in the course of social evolution we have evolved 
techniques of communication that make optimal use of the human organism. 
But there are just enough places where we are sure that social evolution is 
not maximally adaptive so that we should be reasonably skeptical of any 
very dogmatic statements that these performances are the very best that any 
man will ever do. 


4. — Memory. 


From all that has been said up to this point it must be clear that if 
one looks at input and output rates alone, a man does not turn in a very 
impressive performance. Surely there are capacities of the human organism 
that are not topped in these terms alone. Where is the trouble? Do we over- 
estimate ourselves? Are we really just a somewhat overworked 25-bit channel? 
Or are the facts that I have supplied you grossly in error? 

First of all I would say that I am strongly convinced the facts are correct. 
Admittedly this is a judgment but by now enough people have been over this 
ground to make the main points pretty clear. Well then, is a man such a poor 
machine? No, I think not for two reasons. First of all he is not one but a 
very large set of 25-bit channels. As I have remarked earlier, the rate at which 
a man reprograms himself appears to be high. Let me just repeat that in many 
ways this problem of set is almost the key problem with which we have to 
deal. Let me remark that I believe I am pointing to the same problem that 
W. ROSENBLITH is going to describe as a «state» variable. The second reason 
man does so well is because he makes use of a substantial memory. It is to 


this problem I should like to turn. 


One very simple requirement of the human system for storage is a conse- 


quence of the redundancy of our inputs and outputs. The signals furnished 


to us by nature are highly redundant and the signals that we generate are also 
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redundant. Much of this redundancy is of almost no consequence at all. Thus 
it is trivial that a visual signal has any particularly coherence within a time 
period of less than 1/100 s. The photoreceptor processes are such that they 
respond to an average of the radiant energy over a period at least this long. 
There are many other instances of the same kind in which the possible inputs 
simply fall outside the range to which a person could possibly respond. But 
there can equally well be coherence within the range of possible response but 
where the responder makes no use of the redundancy. For one thing, there 
is often no stress. Most of us walk about the streets or our homes making use 
of visual cues under circumstances where we could probably do almost as well 
if we were blind. At least in terms of the tests that we perform, there is little 
or no evidence that we make use of the strongly habitual character of many acts. 

Speech communication is perhaps one of the bright spots. Certainly many 
of us try to listen under stress. Conversation at a cocktail party or on a subway 
train is marginal. I am sure you would all testify to that. And were it not 
that cocktail party conversation with a pretty girl requires only about a 
two-bit answer, it would be totally impossible. More seriously, the evidence 
is that we depend very critically upon the redundancy of speech and if we 
are to make use of this redundancy we need a good storage system. G. MIL- 


LER (1956) has remarked, and I agree most heartily, that it is the immediate 


context that is of paramount inportance. In the case of speech most of the 
constraints are contained in the 10 to 15 adjacent letters. But this means 
that there must be a running storage of perhaps 3 to 6 words held in somewhat 
raw form that can be reinterpreted on the basis of the total pattern. 

Actually there is rather little detailed evidence on this kind of short-term 
memory and just how it functions. Two findings have a possible bearing, 
however. These are the immediate memory span and the eye-hand span. To 
start with let us look at the immediate memory span. If a series of numerals 
are spoken, then a subject can repeat immediately a span of from seven to 
ten digits. This test is a very familiar one because it has been used for years. 
as part of the Binet tests of intelligence. The span is highly correlated with. 
the age of children. 

What happens when we use letters or words instead of digits? Is the infor- 
mational content of the stored material in any way constant? It is found that 
the span falls off slightly as the material is drawn from a larger set. In an 
experiment by Hayrs (1952; see also MILLER 1956) the span decreases from 
about nine digits to about five words from a 1000-word vocabulary. These 
results are rather far from being invariant when converted into informational 
terms. Nine digits are 28 bits and five words represent at least 50 bits. Other. 
results such as MILLER and SELFRIDGE (1950) on the one end and SMITH. 
(see MILLER 1956) on the other end, suggest that this function is rather flat, 
but I am inclined to be quite skeptical of both results. For one thing, the 
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method used by SELFRIDGE to construct material leaves the informational 
value of her words in doubt. At the same time, the fact that SMITH (and HAYES) 


- used limited sets of numbers is relatively artificial. I am doubtful that telling 


a subject he now is dealing with 1 and 0 is sufficient to change his set from 


- the usual set of decimal numbers. 


There is the further difficulty that this type of experiment is in a way 
self-defeating. To the extent that numbers contain less information, more of 
them must be retained and this requires a longer time. But if the process 
under investigation has some progressive decay function then the longer list 
will be less well retained. There is the very considerable further problem 
that the span is measured in isolation and the process of using the context 
in understanding information through a noisy channel is a relatively conti- 
nuous one. 

And yet — in spite of all these difficulties and limitations — it is still true 
that the order of magnitude of the context necessary in order to utilize the 
redundancy of language is of the same order of magnitude as the immediate 
memory span. 

The other line of evidence comes from studies of the eye-hand span. Now 
there is in any real informational processing link a certain delay. This can be 
thought of as a straight transmission and processing delay, and it is very 
analogous to the reaction time measured in the isolated disjunctive reaction 
experiment, Let us consider what happens as a person copies a telegram or a 
radio message or typewrites from a copy at which he is looking. Many of the 
movements made, such as pressing a key, take far less time than the 0.2 to 
0.3.8 required for a reaction. Obviously the second action must start before 
the first is completed, and there easily develops a lag in the motor process 
so that the eye is reading or the ear hearing well ahead of what is being written. 
The same thing is true of random tracking. A man threading his way through 
a crowd is looking ahead for the next opening as he passes between the first 
people. 

This performance alone, and the delay coupled with it, make no particular 
demands on memory. Furthermore, the evidence suggests that any forced 
delay beyond the necessary processing time is likely to be injurious to the 
performance. It must also be clear that the task needs to be of such a kind 
that advanced information is available. An example of this kind is tracking 
a line with random movement. A moving point on a line is seen through a 
window, and a pointer is adjusted so that it will be on the line. If there is 
no coherence in the movement of the line, no. advanced information, and a 
delay in the execution of the tracking, then errors will be random and equal 
in range to the amplitude of the movement. But if there is a window ahead 
of the point tracked equal to or slightly greater than the tracking delay, per- 
formance is optimum. 
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; 


From the point of view of memory the more interesting cases are those in 
which the sequence is not random. Speech, and in particular meaningful 
speech, represents just such a sequence. For such material the eye-voice span, 
to take one instance, lengthens out a great deal. The effect is perfectly fami- 
liar to anyone who reads aloud. The phrasing, the emphasis, the melodie 
pattern is anticipated by the reader whose eye ranges well in advance of his 
voice. A listener can almost gauge the span of the reader by the periodicity 
of his speech, to some extent by the points at which he pauses. Furthermore, 
these periods are a rather direct function of the informational (i.e. surprise). 
value of the text. 

I believe that fairly direct studies have been attempted on this Span in 
typewriting. The technical difficulties are considerable and no published 
results are available. But the general results that have been reported are the 
same. For more meaningful material the span increases over what it is for 
‘material with no sequential coherence (*). 

To sum up, the human organism has a short-term active memory limited 
to a matter of a few seconds duration, to perhaps ten « chunks » of information, 
and to moderate complexity, perhaps 10 bits per chunk. This active memory 
is of the greatest importance in exploiting the sequential properties of infor- 
mation that comes to us and it is of equal importance in organizing the sequence 
of actions that a person carries out. 

In this connection, I should like to point out that there is a large area, 
| as yet little explored, in which there are a related set of problems. These have 
to do with the highly complex motor skills of a man (or a race horse!). In 


speaking, in playing tennis, in dancing, in the skilled movements of a carpenter, |. 


4 mason, or a musician, in all these cases the pattern of the movement is - 


formed within the central nervous System under only rather general and 


supervisory control of the peripheral sense organs. LASHLEY has recently | 


discussed this problem. All I can Say here is that its analysis lies ahead of us 
and contributes very little just now to our problem. 


5. — Learning. 


The problems of learning and of long time memory are of such com- 
plexity that I shall not take them up here. 


‘ 


(*) There exists a related problem with regard to the spatial coherence of a visual 
pattern but almost nothing at all is known’ about this. 
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6. — Man as a channel. 


This brings me to a point where I should like to venture a tew rather 


- general statements about the human organism as a processor of information. 


x: 


Throughout this discussion I have emphasized the fact that a man has a rather 
low capacity for handling simultaneous information. At no point do his achieve- 


| ments appear to be particularly great. Furthermore, it is my impression that 


physiological studies of the central nervous system suggest the same conclusion. 
Perhaps my colleagues here will dispute this. I can only report to you that 
when I first started making records of potentials in the human brain more 
than thirty years ago, we thought we might be missing something because 
our electrodes were so gross. And yet what we found was a considerable correla- 
tion among signals from electrodes spaced over several millimeters. | 

There are a number of alternatives open to us if we speculate about how | 
men achieve so much with a mechanism of this kind. One possibility lies in 
the operation of what I have referred to as «set». This is a mechanism that 
operates broadly to re-program the human computer, to rearrange the fune- 
tions played by his various parts so that the greatest resolving power is brought 
to bear at the point where it is most needed. McKAY made a somewhat similar 
point several years ago when he suggested that high precision was attained 
by making successive adjustments, each time increasing the sensitivity of the 
feed-back loop as the adjustment became closer. In McKay’s scheme the 
hierarchical arrangement for changing the parameters of the system had to 
do only with precision. But the same argument applies, for example, in the 
comprehension of language, and perhaps in memory and recall. A slight change 
in the set that a man has can make all the difference between productive 
thinking and a stupid response. The suggestion is that by directing attention 
to a limited part of a problem, the informational requirements are reduced 
to manageable proportions. 

A very different instance that argues in the same direction is the relation 
of eye movements to detailed vision. As a camera the eye is impossibly bad. 
T am sure that almost none of you are aware how extremely limited is the 
field in which you have sharp vision. If you fixate any point on this page 
and try to read the letters in a line two or three centimeters away, you will 
find that you cannot do it. And yet you stand on the shore of the lake here 
and believe that you take in all the details of the glorious scene before you, 


the fisy jumping, the sail on the horizon, the jewel-like roofs of the houses in 


the distance. Actually the «seeing» of a scene like this requires a hundred 
darting glances with the eyes focussed each time on one small part of the 
total picture. Lying back of your eyes there must be a switching system that 
keeps up with this changing flow of fine details and their varying relations, 
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a mechanism that sorts the successive impressions into just their proper place. 
and combines them with all that you know about lakes and mountains and. 
how wonderful it is to be alive. Once more we see that the unusual feature 
of the organization of the nervous system is its ability to take a low capacity 
channel and to program it so that you achieve a high capacity output. There 
is, of course, some trading of space for time, and a requirement for an active. 
memory. But even here the point to remember is that there is no fixed program, 
and the amount of recoding that is done depends upon what needs to be done. 
One respect in which the natural system is different from an artificial 
system is the way in which it handles the problem of error. As von NEUMANN 
remarked some years ago, the test of a large computer is its ability to handle 
its own errors. This has led in the direction of building artificial machines of 
steadily increasing reliability, and of error-detecting devices that keep error 
at a very low level of probability. The human nervous system obviously does 


_not work this way. Any one part of the system is relatively noisy, but the 


likely output would appear to be subject to constant check and correction. 
In this way, the initial information which has a low level of likelihood attached 
to it gradually undergoes refinement until the final result may be quite exact. 

Another difference may lie in the employment of more kinds of elements 
than we now recognize. We tend to think of nervous activity as made up of 
a single mode of response. In contrast with this idea, I am reminded of the 
remark in von NEUMANN’s (1957) posthumous book that it generally becomes 
inefficient to build the memory of a large computer out of active elements. 
Perhaps there are several kinds of control in the nervous system that more 
nearly resemble passive elements. There is talk about «state » variables in 
both psychology and physiology today, but it is too early to see just where. 
this kind of thinking leads. 

Any connection of this kind of speculation with information theory is pretty 
remote, as I should like to be the first person to admit. But then the test 
of information theory, as of any other theory, is in its usefulness in suggesting 
new facts, and new ways of lookin g at old facts. This certainly has happened 
in psychology. Many very ancient psychological problems have a new ap- | 
pearance as a result of the subject this Course has been discussing. The 
only trouble I have is in knowing what information theory is — is it physics, 
or mathematics, or game theory, or perhaps philosophy? 
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On the Coding Theorem and its Converse 
for Finite-Memory Channels (*). 


A. FEINSTEIN 


Stanford University - Stanford, Cal. 


1. — Introduction. 


A central problem of information theory is the following: 

Let X and Y be sets, consisting of finite numbers of elements denoted 
by x and y, respectively. For each #e X let p( |x) denote a probability distri- 
bution on Y. For any positive integer n, let X" and Y" denote the product 
spaces [[ XX; and J] x Y;, where X,= X and Y,;=Y. We will denote the 

t=1 


| f= t=1 

elements of XY” by «, and of Y" by v. For each we X” we define a probability 
distribution p( |w) on Y" according to p( |u)= p( |) X...X p( |[æ,) where 
WU = (%,..+, 7%). For a fixed e, 0<e<1, let N(n,e) be the largest integer. 
for which there exists a set w1, ..., Um Of elements in X" and disjoint sets 
B,, +.) Bxme in Y" such that p(B;|u)>1—e for i=1,..., N(n, e). What 
can be said about the behaviour of M(n, e)? The following results have been 
known for several years. 


Coding theorem for discrete channel without memory. — There exists a constant 
C>0 (which is in general non-zero) such that for any e, 0<e<1, and H, 
0<H< C there is an n(e, H) such that N(n, e) > 2"? for all n > n(e, H). 


Weak converse. — The statement of the coding theorem is not true for any. 
H>C. Specifically we have log N(n, e) < (nC+1)/(1— e) for all n. 


Quite recently WoLFOWITZ [4] has obtained a sharper estimate for N(n, e) 
as follows: | 


(*) This work was carried out at Stanford University, Stanford, Cal. and wall 
supported in part by the Office of Naval Research, United States Navy. 
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Strong converse. — For any e, 0 <e <1, we have lim sup (1/n) log N(n, e) <C. 


The constant © is called the channel capacity. All logarithms here and 
henceforth are taken to the base 2. 

The coding theorem and the strong converse may be summed up by the 
assertion lim (1/n) log N(n, e) = 0, 0<e<1. However, for the purpose of ge- 
neralizing the problem under consideration, it is best to consider these three 
results separately. 

The case e= 0 is singular, and appears to offer greater difficulties than 
the case e > 0. That is, while it is easily shown that lim (1/n) log N(n, 0) = CO, 
exists, a simple algorithm for determining ©, even in some of the simplest non- 
trivial cases, is not known. | 

The triple X, Y, p( |x) is said to define a discrete channel without me- 
mory, and C is called its capacity. The capacity is determined as follows: 
4 given a probability distribution p() on X, we can define a ene on 
XxY by p(x, y) = p(x) p(y|x) and à distribution on Y by p(y GBA Ly Y). 


We define ae PILLS ) log p(x), eee ee ) log p(y), and HX, Da 
=— Y p(a, y) log Fens in which we take 0 log 0 — 0. Then the quantity 


vi _ H(X)+H(Y)-H(X, Y) is non-negative, and © = max KR, E, is called, 
the rate of the channel with respect to the input distribution p( ) on X; the 
existence of max R, follows from a simple continuity argument (*). 

The situation can be generalized in various directions; that which will 
interest us here is the following: 

Let X, Y be as before, and let X7 = II XX Y=TFxY,, where A, = 


and Y,= Y. For each element #,eX’ let »( |v.) be a probability measure 
on the Borel field 7, generated by the cylinder sets in Y’ which satisfies the 
following conditions. 

1) For any cylinder set Sc Y’, v(S|e_) is measurable with respect to 
the Borel field 7, generated by the cylinder sets in Xx, 

2) v( |[æ,) is stationary in the sense that for any cylinder Sc Y’ v 
have »(7'S|Tx,,)=»(S|#,,), where 7 is the shift transformation defined on pi 
(and Y2 similarly) by (Ta) = (Ga)nti, Where (), denotes the n-th compo- 
nent of the term within the brackets. 


3) v( |v) is non-anticipating; i.e, if SeF, is such that there is a 


fixed t for which (..., Y_1+ Yo) Yi) ---)€S implies that (,.. Yes Var Ugg =) E SE 


(‘) For a complete treatment of the various results which are stated here and fur- 
ther on, see FEINSTEIN [1]. 
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Y;=Y; for all i<t, then »(S|a,) = »(S|a.,) whenever #,=-; for all <td 
The triple X, Y, »( |æ,) is said to define a discrete channel with memory. 
Let u() be a stationary probability measure on ¥,. Then 


ASBI | »(B| 20) 4(dater) 


for any cylinders A c XY’ and Bc Y’ defines a stationary probability measure 
w() on the space X’x Y’~ (Xx Y)’, and 7(B) = (X’x B) for any cylinder 
Bc Y’ defines a stationary probability measure n() on F,. Now each ele- 
ment (wu, v) e X"x Y" defines a cylinder set in (Xx Y}, namely that consist- 
ing of all pairs (x,,y.) for which the 1-st, 2-nd, ..., n-th components of 7, 
and y, respectively are #,, %,..., æ, and Yrs Yas ses Yny Where % = (2, 3000) 
and v= (Y1,.…., Yn). Thus the measure @() on (Xx Y)’ induces a probability 
distribution ©,() on X"x Y". Similarly the measures u() and 7() on X7 and 
Y define distributions j,() and Hn() on X and Y”, and w,(u) = @,(uXY"), 
Nn(V) = w,(X"Xv). If we define Eig = A(X") + H(Y*)— A(X", Y*), then 
lim (1/n)R,n =, exists. The quantity 0, = lub R,, where lub is taken over 


nc 
all stationary probability measures u(), is called the stationary capacity of 
the channel X, Y, »( |”). The quantity Ce = lub RS where E, denotes 
that the measure ©( ) is ergodic, is called the ergodic capacity of the channel. 
Clearly 0.< 0, if 0, exists; however, there are channels for which ©( ) is not 
ergodic for any choice of u() (FEINSTEIN [1]; (pe 97). 

For the general discrete channel with memory, as defined above, no reason- 
able analogue of the coding theorem or its converses is known. For channels 
with finite memory, however, an analogue of the coding theorem does hold. 

A discrete channel with memory is said to have finite memory if there 
exists a non-negative integer m such that: 


m.1. For any cylinder [y;, ...,y,] in Y! we have V(LYs, + > Yr] [2 ,) 
= (Ty, ++. Yr] |”.,) whenever a, =%, t=t—mMm,..., 7. 


m.2. For any two cylinders [Yes Ys] and [Ys ++) Yn] Such that 


j+m< k we have VUYis oy Ys] [Yas +; Yn] |®o) =v([yi, Yi) |@o)P (Yar oy Yn] LA 
for all we XT. 


The smallest integer m for which both m.1 and m.2 hold is called the 
memory of the channel, 

For a channel with finite memory m it is evident that V([Y mrs «s+ Yn) | * 
‘|[2, ++) æ,]) is well defined for any n>m. Thus x |x,) defines, for any 
NZ M, a probability distribution on Yr-" for every element u = (a,, ..., Cn) | 
in X", which we will denote by »( |w). Specifically, we define v( |u) by 
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7 


D(Y13 + Ynm|U) = Wars ney n] [Lary +, æ,]) where y Res = Yi. Un = Yn—m- 
Then the following is known: 
Coding theorem for discrete finite-memory channels. — For any e, 0<¢< 1) 
and H, 0 <H < C,, there exists an n(e, H) such that for any n>n(e, H) 
| there exist elements w,, ..., #, in X* and disjoint sets B,, ..., B, in Y*-" such 
that »(B;|u,) >1—e and N > 2". 
Since it can be shown that for a finite-memory channel, the ergodicity 
of u() implies that of w(), 0, exists and is in general non-zero. 


2. — Channel capacity and the coding theorem. 


In this Section we will give a new definition of capacity for a discrete finite- 
memory channel, and derive the coding theorem and its weak converse in 
terms of this capacity. We will also show that this capacity is not less than (,; 
in the following Section we will see that actually C,= 0,= 0. 

We have seen that for each n > m, »( |u) is a well defined probability 
distribution on Y"-" for every ue X”. Thus X”, Y"-”, »( |u) define a discrete 
channel without memory; let ©, denote its capacity. 


Definition. — By the capacity of the discrete finite-memory channel X, Y, 
v( |x_) we will mean the quantity C= lub C,/n. 


n>m 


Theorem 1. C=lim C,/n< oo. 


Proof. We will show that C;,, > C;+C,, i,j>m. Therefore — C; is 

a subadditive function, and so lim (— C;/i) = glb (— Cillo C.li. To 
i—>co i>m i>m 
show that C,,, > C;+0;, let p;() and p;() be probability distributions on 
X‘ and X’ respectively for which the capacities C; and C are achieved. Then 
[p:Xp;l() defines a probability distribution on X*; let R,.; be the rate of 
Xii, Yiti-m, y( |u) with respect to p[:Xp;](). Let us apply now the data 
process on Yi#-" which identifies (Y1, ..., Yiti-m) ANA (Y,, ++» Yiti—m) Uf Y= Yr 
for k Ai—m-+1,..., à and let R;,; be the rate after data processing. Then 
R,.;>R:;; but by virtue of m.2 and the product form of the input distri- 
bution [p;X p;]() it follows easily that R,,,;=0;,+C;. Since 04; > R:;, the 
proof is complete. As for C< co, we have ©, < D" where D" is the number 
of elements in X*; hence C< log D. 


Pireorem 27 6,= C. 

Proof. Let u() be a stationary probability measure on ¥,, and let 
Ruim be the rate of the memoryless channel X*+”, Y", »( |u) with respect to 
the distribution defined on X"+" by u(). Then Cyr > Ratm- On the other 
hand, if we contract (see Appendix I) with respect to the first m components 
of X**”, then the rate is precisely È,,, and therefore R,, <Rntm < Cia LOE 
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all n >1. Therefore R,=lim R,,/n <lim Cnym/n = lim Chim/(n+m) = €, and 
so C, = lub E, =; Su 
lt 


Coding theorem for discrete finite-memory channels. — Let X, Y, »( |a.) be 
a discrete finite-memory channel with capacity C>0. Then for any e, 
0<e<1and H,0<H<C there is an n(e, H) such that for any n > n(e, H) 
we can find elements w,, ..., vy in X” and disjoint sets B,, ..., By in Y*- such 
that »(B;|u,) >1—e, i=1,..., N and N >2”*. 

Proof. Since C=lim C,/n there is a k for which 0, > kH. Choose an 


n> o 


H' satisfying kH < H'< C,; then by the coding theorem for the memoryless 
channel X*, Y#-", »( |w) there is an n,(e, H') such that for any s > n;(e, H’) 
there is a set w,,..., wy in X* and disjoint sets B,, ..., By in Y®-»* such that 
p(B; |w,.) >1—e, i=1,..., N and N >2” where p( |w)=( |) X... X( | Us) 
with w= (u,,..., Us). Now to each element y,,.... Yams in Y%-"* we asso- 
ciate a set g(Y1, +. Yom) in Y*~™, defined by P(r, -+-) Ya-ms) = (Yrs+++9 Ya-m)X 
x Y™ X(Ya-mtry cs Yam) X EP Xe X (Yor-mis—yt19 + Yor-ms)- Tf we define 
B, = q(B;), i=1,..., N, then the B, are disjoint sets in Y*-™, and by m.2 
it follows (*) that p(B;|w,) = »(B,|w,) >1—e, i=1,..., N, and N > 27 >2*, 
which proves the coding theorem for all n of the form ks for s > ,(e, H'). 
Let us assume n,(e, H') taken so large that 


n,(e, H') kH 
nat HO) el Be 


For any n>kn,(e, H') set n=s8,k+r, where 0<r<k. For n'= s,k the 
theorem is proved; let w,, ..., vw, and B,,..., By, N > 2%” be the corresponding 
elements in X* and sets in Y"-" respectively. Let 2, be any fixed element 
in X*; then (20, «,) and B,= Y"x B,, î=1,..., N defines elements w; in X" 
and disjoint sets B, in Y"-", By m.1 follows that »(B',|w,) =»(B;|u,) for all i. 
Since N > 2°" and s,4'/n>{(nx(e,H'))/(nx(e, H') +1)}(H'/k)>H, we have N > 2"4 
which completes the proof. | 


Weak converse. — Let N(n, e) be the largest integer for which we can find 
elements 1, ... Wyn,. in X* and disjoint sets B,, ..., By, in Y*-" such that 
v(B.|u)>1—e, i=1,..., N(n,e), where e satisfies 0<e<1. Then log 
N(n, e) <(n0+1)/(1— e). 


Proof. It is shown in FEINSTEIN [1] that the existence of elements 
U1, +++) Uy, ANd sets B,,..., By, having the stated properties implies that 
C, > log N(n, e) —elog N(n, e) —1. But C> C,/n, from which the desired 
results follows. 


(*) See also FEINSTEIN [1], p. 104. 


1376 
VT em ee 


1 


% ON THE CODING THEOREM AND ITS CONVERSE FOR FINITE-MEMORY CHANNELS 565 


… 3. — Equality of C,, 0, and C. 


We have seen above that 0,< C. It follows from the definition of C° 
that C.<C,. In order to show that C,= 0,= 0, we will show that 0, > 0° 
_ This will be accomplished by constructing, for each j > m, an ergodic pro- 

bability measure w;() on X’ such that R, > C;lj. Now it is easily shown 
(cf. FEINSTEIN [1], p. 99-103) that for a finite-memory channel the ergodicity 
of u() implies that of w(). Hence C,>lub È, >lub 0;/j= C. 
The construction of w,() is based on the following considerations (*). Let 
P() be, for some fixed s, a probability distribution on X*. We define a pro- 
bability measure g() on F, as follows: for any integer m>0 we put 
e (= Psr) ss Puo) + Der Uma) Since any:.Cylin: 
der set in X’ is the union of finitely many disjoint cylinders of the type for 
| which we have defined q( ), it follows readily that g()is well defined on F,. 
- For arbitrary p(), qg() will not in general be stationary with respect to the 
shift transformation T on X’. However q() is evidently stationary with 
respect to T°. We define g(A)=1/s(g(A)+g(TA)+...+g(T*1!4)) for all 
AegG,; then g(TA)=1/s(g(TA)+q(T*A)+...+g(T5A))=g(A) since q(7*)A = 
=q(A). Thus @() is a stationary probability measure on F,. 


Lemma 1. The-measure q() is ergodic with respect to 7. 
Proof. It is well known (see e.g. FEINSTEIN [1], p. 99-102) that for the 
ergodicity of a probability measure u() on F, it is sufficient (and indeed neces- 
n-1 
sary also) that lim 1/n > u(T-*An B) = u(A)u(B) for all cylinders A, Bc X’. 
4=0 
Due to the cnt of both sides with respect to A and B, it evidently suf- 
fices to verify this condition for A and B of the form [@ mneii, ---5 Cms]. Now 
ee es) — lial AnB) +o l*A 7 TB) +... g(t "4 n TB) 
From the definition of q() it easily follows that when è > (2m+1)s —1 we 
have g(T“*A n B) = q(T “*A) q(B), (TA n TB) = q(P*14) q(TB), 
(Le A n TB) =g(T-*-1A) g(7"B). Since by virtue of the stationary 
of q() with respect to T°, q(T-*A), q(T-*1A), ..., g(T-#-14) all run cyclically 
through the values q(A), g(TA),. ” q(Ts-1A) as è runs over all positive inte- 


gers, while the value of lim 1/n Sat -*A n B) is independent of the value of 
i=0 


q(T“A n B) for any finite ri of values of 4, it follows that 


n—1 


im DS q(T-‘A 9B) gee + q(T A) +... + (2 14)]g(B), 


n—>o PY = 0 
n—1 


lim — È DI q(T-'+' An TB) = [ad + q(T*-1A)|q(TB) , 


no NM ja 


(*) The costruction of 4;( ) was suggested by a result (Theorem 7) of J. NEDOMA [2]. 
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ete. Hence 


n 


lim SATA n B) = Afq( A) +... + q(T4A) IL g(B) +... + (1°-18)]= 2(4)UB); 


no N {= 


and so q() is ergodic. 


Lemma 2. Let £() be a probability measure on ¥ x Which is stationary 
with respect to T°, Then the quantity 


H,(X) = lim — : > Bleu, DECO) n+; ]) log B (e145 LIDO Xnti]) ’ 


N—>co n GI 


exists and is independent of i. 


Proof. The demonstration is a simple adaptation of the usual one for 
Stationary f(). For each fixed r, the sequence H(X,|H,_,), H(X, | Xy-1, X,),… 
is known to be non-increasing, and therefore HA, = lim H(X,|X,_4, ..-) X,-;) 


exists. Clearly the sequence H,, H.,... is periodic with period s. Now 


AO (TR ne %n+i]) = —log B([2,+:]) — 


Brit, da+:]) Bates +; Xnti]) 
Blau) ERO, ae 


log 


Multiplying the right side by B([214i3 +++) Cn+:i]) and summing over Coal 
Xn+i, We obtain H(X,4) +H (Lori | Lits) + AL | Xaver, vey Ta+:). Let us. 
single out the terms H(X,,;), H(X,4,4s [Xray co Lit) HA [Xesasy ey Magara 
Evidently Lire] Zope y dt) EA Pes Kits), H(2 1+etaa Lee 
Lits) = H(Xis;|Xi, ..., Xin), ete.’ Therefore lim A(X y+ 6458 |Xevsey «I = 


=H,,;. Similarly lira H(Xascrse|-Xatersey «cp ate) == Hare) and 80. on. Using | 
the simple fact that the Césaro averages of a convergent sequence converge | 
to the limit of the Sequence, it follows easily that 


nag! = ; > 3 ; 
lim a Ai) cn H (X24; |X14;) Fat H(Xnti| Lat) ALS Xiti) = 


No N 


af 
six + Hi +. + H,5)4 


Since the sequence H,, H,,... has period s, the average on the right is in- x 
dependent of i, which completes the proof. 


The following result was pointed out by L. BREIMAN. 


cd 
re 


| 
n 
: 


à 


a 
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Lemma 3. Let g;(),..., qg.() be a set of probability measures on F, 


‘ such that 


N—>co 


; 1 
doi au > di([21, sey Zn) log CAE rosy tn) ’ 


exists, i=1,...,86. Given a set a,..., a, a; > 0 
Then : 


iM: 
i 


a; = 1, let q()= DI WiQi( ). 
i=1 


é al pa 39 8 
limo > q([as, ee) ®n]) log ql, e) ®n]) = DI a;H ;(X) . 


No n Lon 
Proof. We have 


dla], 0) 


s 


=> dla, Un) log gf CA) 


tai ao hi. x 4: (Ce + eal) pi ee 


Et a (Ta ; .……. Ln})J 


; SE f BiG (EA, +, al) 
ade ra) 108 Gee A Bay a) E108 41 ap > PTE 


ll. 
J 
Now 


s 1 
lim ao Dagli... On| log OG (rl) 


SPD Lire En 


È al È , 
=—@,lim—— > gr, ..., H,])log’¢:([a,, ..., &n]) =aHAX) , tdi on 


N—> co q 
n Biss Cry 


Furthermore, the inequality log (1+x)<xloge implies that for each 
Hier, 8 


0 < aq, 5 tel) log {1 + me = a < ¥ ail os Ma) log e. 


Thus 
rici . f aq ([ 21 0) a 
Ox hm — ada ences Lon |) LOS easly = 
re eee de((21, + %n]) log CER) 


1 
< lim ae >? > 4:9;((%1, …, 4,1) log e < lim ri > a%loge=0, 


Ro Mmes SÉ te 


for each i, where De indicates summation over those @,,..., %, for which 


iste core 
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qi([@1, -.-, ,]) > 0. Combining this with the preceding results establishes the 


lemma. 
Theorem 3. GC,=0,=0C. 


Proof. In view of Theorem 2 and the evident relation C,< C,, it suf- 
fices to show that O, > 0. Since C= lim C;/j we can find, for any e > 0, an 
ja>o 


” such that C,/r> C — è. Let p() be the probability distribution on X’ for 
which C, is attained, and let q() and g() be the probability measures on F Pe 
derived from p() as defined preceding Lemma 1. Since g() is stationary with 
respect to 7”, it follows that the probability measures @() and 7( ) defined by 
(Ax B)=xB|2,)q(dr,) and y(B)=@(X"xB) for any cylinders Ac X7, 


A 
Bc Y' are likewise stationary with respect to 7” on (Xx Y)! and Y? respect- 
ively. Let g.(8)=g(T'8) for SeF,, i=0,..., r—1. Applying Lemma 2 to 
di(), ©;(), and 7;(), it follows that the quantities R , = lim Rin exist and 


q 
no 


are independent of 7; let their common value be R,. By Lemma 3 as applied 
to q(), ©(), and 7(), it follows that R7=,, and from Lemma 1 and the de- 
finition of 0, follows then C, > R;=R,. We will now show that R, > C,/r. 
Now &, = lim R,a,/dr; but R,,, is the rate of a channel (in the sense of Ap- 


pendix I) whose input space is 


TR ARS 
d factors 


and whose output space is 


NA ap ATI ee 
d factors 


Now apply the data process on Y® which, in each factor Y", identifies 
two elements y,,..., y, and Yi, ..., y. if Vea SE Yass toy VS Det Ria be the — 
rate after data processing; then Rai < Ra. But it follows from m.2 [1], 
and the product nature of q() that Ria, = d0,. Hence Raarldr > C0 —e for 
all d, and therefore C, > R;=R,= lim R,,,/dr > C0 — e; but e was arbitrary, 
hence 0, > 0. si - 


4. — Extension of a result of Wolfowitz. 


In this Section we will consider a particular type of discrete finite-memory … 
channel, for which we can establish the strong converse of the coding theorem. 
This channel, which has been studied recently by WoLFowITz [4], is defined 
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as follows. Let m> 0 be a fixed integer, and for every [CAPRI STO iP. it 
let p( |&,,..., %m+1) be a probability distribution on Y. Let DE (D DE AT tie) 
then we set 


> 


e an) CD, se. 


where the «alignment» of the product measure is fixed by »([y||a,) = 
= P(Yo|®m; ++.) do). It is readily verified that »( |v) defines a channel with 
memory m, for which m.2 is satisfied for m= 0. 

We will prove the strong converse of the coding theorem for this channel, 
in a form which is clearly analogous to the memoryless case. 


Strong converse. Let the channel X, Y, »( |v) be as above, and 
let C>0 be its capacity. Let e, H be fixed, such that 0<e<1, H> C. 
Then it is not possible to find arbitrarily large n such that there exist ele- 
ments %,, ..., U, € À" and disjoint sets B,, ..., B, in Y*-" such that »(B;|u,) > 
ne, 1—1,.., N, and VN >2™. 


N 


Proof. We proceed by contradiction. Given H > 0 we choose integers 
r>m and d so large that ((d —1)/d)((r—m)/r)H > C; r and d are hence- 
forth fixed. For n>(d—i)(r—m)+m define k by (k—1)(r — m)+m < 
<n<k(r—m)+m. Suppose that n> (d—1)(r—m)-+m is such that there 
exist elements %,,..., uy € À" and disjoint sets B,,..., By in Y"-" for which 
o(B,\u;) >1—e, i=1,..., N, and N>2°7. Let n'=Kk(r—m)+m=n; then, 
as we have seen in the proof of the coding theorem, there exist sequences 
WU, ) <+:) Uy in X* and disjoint sets B,,..., B, in Y*-" such that r(B.|u)> 
>1—e, i=1,...,N. We now define a mapping » which takes elements of 
X into elements of A* as follows: 


Ps By) = 


= (21, Gdo) Lys Cr-m+19 Vr—m+29 +99 Vor—my Lar-2m+1 +.) Cnr-rtmy Uni—r+1 9 1099 Cn) ° 


In words, we begin by setting down the first r elements of (a, ..., %,); 
then we repeat the last m, then set down the next r — m, then repeat the last 
m, then set down the next 7 — m, and so on, the process ending as soon as @,, 
has been reached. It is clear that distinct elements in X” go into distinct 
elements in X* under the mapping g. Now for each we X*, let p( |w) be 
the probability distribution on Y*"- defined by p(|w) = »( |2,,...,,)X...X 
Xv( amis +) Ver) Where »( |%, ..., 4) is, as usual, a probability distribution 
on Y'-" for each (2, ...,0,) € X°. Now from the particular form of »( |#,) 
it follows that »( |a,, …, Un) and p( |y(a,...,@n)) are identical probability 
distributions on Y*-" = Y**-™,. Let w,=o(u;), +=1, ...,..N; then we have 
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p(B, |w,)>1—e, i=1,.., N, N>2"*. But 


watt _g__%_ > Mr_mtm_ dread 
kr k(r—m) +m kr 


so N >2"* — 2” where rH’>C,. Now as n becomes arbitrarily large, # does 
also, and we therefore have a contradiction of the strong converse of the 
coding theorem for the memoryless channel defined by X7, Y'-", »( | Cyc «eng deed 
which completes the proof. 

We may remark, in passing, that for the channels considered in this Section 
it is simple to obtain a necessary and sufficient condition for the vanishing 
of the capacity ©. Indeed, since Gestup C;/j, it follows that C — 0 implies 


Cmi1= 0, which in turn implies (cf. FEINSTEIN [1], p. 32) that the set 
{ pi Ness i (di, nr) EA"! of probability distributions on Y con- 
sists of only one distinct number. Conversely, this last condition evidently 
implies that the set {r( |w)}, u eX" of probability distributions on Y*- also 
consists of only one distinct member, which implies C,= 0, for all n>m, 
and so C=0. Hence for the vanishing of © it is both necessary and sufficient 
that the set { p( Laie ick lmta)}; (1, +++) Umi1) € A1 consist of only one distinct 
member. As a consequence of this result, the construction of examples of 
(both discrete and semi-continuous) finite-memory channels with non-zero 
capacity becomes trivial (*). 


5. — Semi-continuous finite-memory channels. 


In this Section we will discuss briefly the extent to which our results carry 
over from discrete finite-memory channels to semi-continuous finite-memory 
channels. 

In essence, a semi-continuous channel is obtained from a discrete channel 
by replacing the finite space Y of the latter by an arbitrary space Z in which 
is defined a Borel field. Specifically, a semi-continuous channel without me- 
mory is defined by the usual space X, an arbitrary space Z in which is defined 
a Borel field 7, and, for each x € X, a probability measure p( |x) on F. The 
rate of this channel with respect to a probability distribution p() on X may 
be defined by noticing that although H(X, Y) and H(Y) have no direct ana- 
logues in the semi-continuous case, their difference 


H(X 


Y)= H(X, Y)— X(¥) =— Y Y p(e, y) log p(x|y) 


(*) This result replaces the discussion on p. 98 of FEINSTEIN [1], in which the mea- | 
sure 7/( ) is incorrectly (in general) asserted to be defined by a Markoff chain. 
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is well defined even in this more general situation. Indeed, p(x|2) is well de- 
fined as a Radon-Nikodym derivative, and 


Hixjzj == > fog plelz)p(e, de) , 


is well defined; we put R,—H(X)—H(X]|Z). The capacity is defined as 
before by C= be R,, where again the existence of max À, follows from a 
continuity argument. / 

For a semi-continuous channel without memory, both the coding theorem 
and its weak converse are known to hold; the strong converse is at present 
undecided. 

The definition of an arbitrary semi-continuous channel with memory pro- 
ceeds in similar fashion; we replace Y by Z. However, a technical point 


arises in the definition of »( |v). Let Z"=T]JXZ,, Z;= Z, and let F” be 


the Borel field in Z’ which is determined by ¥ in the usual manner. Let ¥” 
be the Borel field of sets Se 77 such that (..., 21) 20,21; +.) € S and 2,—2 


a) 


A4=—n,...,n implies that (...,2_,,%)%,)---)€S. Let u() be a set function 


defined on Y ¥* which is a probability measure on each ¥". Then it is not 


n=l 
unrestrictedly true that u() can be extended to ¥’ as a probability measure. 
Now in our results for the discrete case the only property of »( |x) that 
was actually used was that it was defined on Ù F", and that it was a pro- 


n=1 


bability distribution on ¥” for every n (and similarly for w() and 7( )( (*). 
The same is true here; it is sufficient to require only that »( |w,) is defined 
on Ù F” and is a probability measure of ¥” for each n. It follows that for 


n=1 


a given probability measure q() on ¥,, ©() and 7() are not necessarily 
measures on 7,x9’ or F’ respectively, but only on ¥,x¥" and ¥F” for 
every n. 

With these definitions it is easily seen that C oo C;/j = lim C;/ÿ, and 

1>m j—0 

that the proofs of the coding theorem and its weak converse remain unaltered. 
The result 0,= 0,= C requires certain modifications however. Here, the dif- 
ficulty arises in the definition R, = Em R,,/n for the rate with respect to a 


pn 


given stationary probability measure 4() on X”; it is not known whether or 
not the limit È, actually exists. We can, however, avoid the question by 


(*) We may remark that in the discrete case v(|r_,) can be uniquely extended to F x 
according to the Kolmogorov extension theorem; however, this fact was never required. 
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defining R,= lim sup R,,/n; then C, can be defined as in the discrete case, 
and the proof of C,< C remains valid. In order to define ©, in a reasonable 
way, we have to impose some condition similar to ergodicity on æ(). A rea- 
sonable analogue to the discrete case would be that 


n—-1 


lim z > œ(T-‘AnB) = w(A)w(B), 


n>o PL i=0 


for any sets A, B in Ù F,XF". Then it can be shown, just as in the discrete 
n=1 
case, that for a finite-memory channel this condition is equivalent to the ergo- 


dicity of u(). Clearly C,< C, just as in the discrete case, and we show that 
C.=C,=C by proving that C, > 0. Before doing so, we should point out 
that this proof is, at the moment, of no greater interest than an analytical 
exercise, for we do not know whether 0,, as defined here, enters into a state- 
ment of the coding theorem as it does in the discrete case. 

The proof follows that in the discrete case with only minor variations. 
The first concerns the analogue of Lemma 3, or rather an analogue of the 
linearity of R as a function of u(), which followed directly from Lemma 3 
in the discrete case. Actually, the linearity of R is not essential; a slightly 
weaker result is sufficient. Indeed, from the proof of Lemma 3 the following 
is easily shown: 

For each integer r > 1 these is a constant F, > 0 with the following pro- 
perty: Let X, Y, p( |x) be any discrete memoryless channel, let p,( ), ..., 
p,() be probability distributions on X, and let R,, ..., R, be the COREPORSS 


rates of the channel. Set p()=—1/r{pi( }+...+p,( )]; then |. R5 — (1/r) > R; | 


An explicit value for F, is readily obtained, but will not be INTO, 
The same result holds for semi- ere channels without memory. Let 
r—2 for simplicity; we set p,(a, ) = px) p( |x), mit ) — => pv, ), i— 1,2; 


Pl) =3[pilo)-+pale)], Ble, )=Dl@)pl |x), and 7()=3[y;()+7:()]. Then 


H(X|Z)=—¥ | log p(w|z) p(x, de) = 


Z 


pals: ion ali = 
=—73 frog dieta nt a) -3x J roe pic pse, ae). 


D(w\2) = P(x, de) _ py(a, dz) + p,(a, de) _ Pix, dz) + Pale Sf ra (dz) 


(de) (dz) + (de) (dz) mi (de) m(d2) 


a.e. 71(), where the derivatives with respect to 7,() are, by the Lebesgue de- | 
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composition, well defined a.e. 7,(). Further, 


p(x, de) ae D(a, = / (de) p2(xv, de) 
11 (dz) pi(x, de) / p 


md) — (e, de)  p(x, de) 


a.e. p,(#, ). Since a.e. 7,( ) implies a.e. p,(2, ), we have 


p(w |2) = pi(x|e) [1 + 


pal, de) J, , ma(de) 
pi (€, de) / i | 


a.e. p,(x, ). We use this expression in 


1 < 
32 log D(x|2)p(x, dz) b) 


Z 


and the same expression, but with the indices 1, 2 interchanged, in 


1 “I 
3 2 log p(x|2)p.(x, de) . 


Z 


From this point the proof follows that of the discrete case. 
The demonstration of C,>C continues now as in the discrete case. 


Given e > 0 we find r such that 0,/r > C—e; we define p()—1/r > q\{), 
t—0 
where 4; (A) = ga(T* A), +=0,..,r—1, AEF, Now we have that 


Ri 

IR; — PAIE |< 

If we take n= dr, then it follows, just as in the discrete case, that 
Rn=4dC,. To estimate R,,, i=1,..., r—1, we note that the channel (in 
the sense of Appendix I) which defines R,, is equivalent (in the sense of the 
isomorphism defined by 7’) to that defined by ,( ), the family of cylinder 
sets of the form [4,:;, …,æ,,;], and the space of all cylinder of the form 
[2143 +++) 2n+:]. Now if we contract this channel with respect to %4;,..., % 
and Tarta) Vea Lantis and apply the data process which identifies 214:, ...5 nti 
and Dents “He ess = & 19%) Sar = 8 We see that the resulting channel 
is equivalent, by virtue of the stationarity of &,() with respect to 7”, to the 
channel which defines Roue Thus R,, >(d—1)C,, t=1,..., r —1 where 
n= dr, and we have 


Cr 


CO, > > Tim sup 7 [(A 1)C,— bale a CE, 


which completes the proof. 
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6. — Remarks. 


Shortly after this work was completed, a paper appeared by I. P. TSARE- 
GRADSKY [3], in which the quantity C is introduced and the relation 
C,= 03= 0 is proved by methods similar to those used here. 

At this Summer School we were also informed by F. L. STUMPERS that 
certain of our results have been obtained by J. WoLFOWITZ in a paper shortly 
to be published. 


APPENDIX I 


We have defined a discrete memoryless channel as consisting of the triple 
X, Y, p( |v). Now given a probability distribution P(x,y) on XxY, it is 
clear that we may put p(a, y)= p(x) p(y|x), where p(y|x) is, for each a, a 
probability distribution on Y, and is unique for each # for which p(x) = 
= p(x, Y)> 0. The nonuniqueness is of no concern since the rate R = H(X) + 
+ H(Y) — H(X, Y) is uniquely determined by p(x,y); this point of view 
simplifies the discussion of the notion of a contraction of a channel. 

Let p(x, y) be a probability distribution on Xx Y, and let A,,..., Ay be 
disjoint sets in X whose union is X. Then P(A;,Y) is well defined, and 
A, Y, p(A;, y), where Z = {A,,..., An} defines a channel (or a family of chan- 
nels) with a unique input distribution P(A;)= p(4;, Y) on Z, having rate 
k.= H(A) + H(Y) — H(A, Y). The rate R. we call the rate of the channel 
defined by X, Y, p(a, y) after the contraction defined by the family A, and 
A, Y, p(A;, y) we call the contracted channel. To show that the process of 


contraction can never increase the rate of a channel, it suffices to observe that * 


X, Y, p(x, y) may equally well be considered as defining a channel with input 
space Y and output space X, since the rate R — H(X) + H(Y)-H(X,Y) 
is Symmetric in X and Y. From this point of view the contraction becomes 
a data process on the output of the « reversed » channel, in which form the 
non-increase of the rate is well known. 

The notation commonly used in denoting cylinder sets in a product space 
is particularly convenient in this connection; if the input space of a channel 
consists of the family of cylinders [",,...,@,], then by the contraction of this 


channel with respect to the component 2, we mean the family of sets [a 5°... CRI 


each considered as a set of cylinders [a,,...,2,] by virtue of [ts, ...5 2,1 
Ur... al 
® 


The same considerations hold fer semi-continuous channels without memory, 
except for the proof that a contraction never increases the rate. This result 
can be established as follows: let R be the rate of the given channel and R, 
its rate after a given contraction. For any ¢ > 0 there is a data process which 
reduces the contracted channel to a discrete one, and yet reduces its rate to 
a value R,, such that R,, > R.—e. Now if R, is the rate of the original channel 
after this data process, then R>R,. But R, > Ra, Since we are now dealing 
with discrete channels. Thus R >R.—e for any «>, or R > ki. 


1386 


CU 


4 


ON THE CODING THEOREM AND ITS CONVERSE FOR FINITE-MEMORY CHANNELS 575 


APPENDIX II 


Let m be a non-negative integer, and let {a,}, i= m-+-1,... be an infinite 
sequence of finite terms such that a,;,; <a, + a; for all î,j>m. Then {a;} 
is called a subadditive sequence, and we have that lim a;fi exists and equals 
g1b ali = À. 


For the proof, assume first that A > — oo; then for any e > 0 there is an 
integer s > m such that a,/s <A + e. For any n > 2s we define an integer 
k > 0 according ton = ks + r where s < r <2s. Then a, < Ars + dr < ka, + &r, 
and so a,/n <(ks/n)(a,/s) + a./n. Now as n — co, ks/n +1, and we have 


lim Sup An] <a,/s <A + e. Since e is arbitrary, we have lim Sup An] <A. 
; But a,/n > A, which imples that lim a,/n = A. The case A = — oo follows 


in similar caution: 


REFERENCES 


[1] A. FEINSTEIN: Foundations of Information Theory, (New York, 1958). 
. [2] J. Nepoma: The Capacity of a Discrete Channel, Trans. First Prague Conference 


vi nd tite ft as 


on Information Theory, Statistical Decision Functions, Random Process (1956): 
[3] I. P. TsAaREGRADSKY: On the Capacity of a Stationary Channel with Finite Memory 
(in Russian), in Theory of Probability and its Applications, 3, 84 (1948). 
[4] J. WoLrowrrz: The Coding of Messages Subject to Chance Errors, in Illinois Jour. 
of Math., 1, 591 (Dec. 1957); An Upper Bound on the Rate of Transmission of 
Messages, in Illinois Journ. of Math., 2, 137 (March. 1958). 


1387 


SUPPLEMENTO AL VOLUME XIII, SERIE X N. 2, 1959 
0 
DEL NUOVO CIMENTO 3° Trimestre 


Correlation Indices. 
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Suppose we are given » stochastic variables y,, %,.…, Yn, each of which 
can take g discrete values or states, 0,, 0, ..., v,. These stochastic variables 
are not necessarily independent of one another, i.e., they are somehow cor- 
related. We are interested in introducing useful measures of such correlatioe 
in such a way that the following conditions are satisfied. 1) The measure is 
entirely independent of the values v; (i=1,..., g) assigned to the states, 4.e., 
it can be defined even if v; are replaced by non-numerical symbols. 2) Th- 
measure is always non-negative. 3) The total correlation Cl (> 0) is decom 


n 
posed as CO = ST, where T° is non-negative and measures, in a certain 


r=1 


Sense, the strength of the correlation peculiar to a subset of r variables taken 


out of the n variables. Claim is not made that the definitions proposed in this © 


note are unique or the best, but it is intended that they will be meaningful 
and useful according to the usage. 

We want to limit our discussion to two cases: 1) simultaneous, symmetric 
Set of stochastic variables, 2) temporal, Stationary sequence of stochastic va- 
riables. We shall first define the first case. 

We are given n stochastic variables y, Yo, +) Yn, each of which can take 


+ 


g discrete values, 1, 2,..., g. The probability that the variables, y,, %, VER 


simultaneously take values æ,, #2, ..., &n, respectively, will be denoted by 


(1) PW Yo = Las voy Yn = La) > 0, 
with 
9 
(1') Bi (y= Lis Ya = Lay os Yn = Ln) = 1. 
æ=1 


We are speaking of probabilities here, considering an ensemble of similar 


| 


sets of n variables. One can of course consider à time-ensemble of the same 
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set, instead of the simultaneous ensemble of similar sets, but in this case one 
has to assume that there is no temporal correlation. By the symmetric set, 
we mean that the probability p of (1) is invariant for an interchange of values 
of any two variables, 


(2) DOYS Dis Ya Lan ery Yi Lai Ya Va) = 
= p(y. = a1, er Ys = Ly very Yi = Li very Yn = Wn) - 
The case where p is not symmetric is discussed in a forthcoming article in 
the IBM Journal of Research and Development. 
By summing up over possible values of any (n —r) variables out of the 
n variables, one obtains the probability distribution of 7 variables. For instance, 


g g 

(3) POY = Wy 009 Yr = Lr) = dI see De PU = Wy 4 ces Yn = Ln) Q 

ær+1 Cn 
Because of the symmetry, the value of p is determined only by the values 
,..., ©, and is independent of the r variables which take these values. Thus 
we write p” of (1) and pl of (3), as functions of values only: p™(a,, ..., &n) 
and p(x,, ..., %,). The p’s are of course invariant for any permutation of their 
arguments. 3 

The «information » function of r variables (r= 0, 1,..., n) is defined by 


g g 
(4) NOSTRO iii) er) TOR P (Dir de) : 


where S” is a function only of r. Of course, S° — 0. 

We shall next define the case of temporal, stationary sequences. We are 
given an infinite, one-dimensional (temporal) series of stochastic variables: 
(00) Yay Ya Ya Vor Y1> Yz: Ys, +»); Such that any arbitrary segment of r con- 
secutive variables has a unique and definite probability of having a given or- 
dered set of values, say (2:, %,..., æ,). This means that 


(5) PO = Wy es Yr = Lr) = pO (Yr = Wy cs Yrte = dr) ’ 


where k is an arbitrary integer, positive or negative. For this reason, we shall 
write the probability p of (5), simply p° (a, ..., ©). What matters here is 
only that #,,..., x, are the values of any r consecutive variables. Permutations 
of arguments are not allowed. 

Formula (3) is valid also in this case with a specific proviso that y, ..., 
Yrs ++) Yn are consecutive variables. The summation can also be made with 
respect to the first (n —r) variables instead of the last (n — r) variables. The 
r-variable information function Sl can also be defined by the same formula (4), 
and S” is again a function of r only. 
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In the SS case (simultaneous, symmetric), as well as in the TS case (tem- 
poral, stationary) one can prove the following theorem: 


(6) SM = cin I Ss 
with 
(6') 7 = § a t . 
From (6) follows that 
(7) SM È >? se 
j=1 
with 
(8) dI i; =f. 
j=1 
Equality in (6) holds (in the TS case) if and only if 
Pacs QE Ly) = “na CA Pa) D Vas uri. Cr) 
(9) : or 


p(x, , seep Or) == par 19 La) P® (2141, sy Le), 


for all values of the æ’s. In the SS case, due to the invariance for permuta- 
tion, conditions (9) are only two representatives of (1) similar conditions. 
8 


Among the possible decompositions of the type (7), the case, t, = È 
((=1, 2,..., m =r) gives the maximum values to the expression given on the 
right side of (7). Thus 


(10) ST RD ei 
i=1 
If and only if 


(11) Ps, 2) = (ay) p(y) … PA), 


î.e., if and only if the r variables are completely independent, then S® = rS®, 


Therefore, it is quite natural to consider 
(12) om = y Sv _ gn > 0 


as the total correlation that exists in a configuration of r variables, in both 
SS and TS cases. In particular 


(13) CO" = ng — 9” > 9 


is the total correlation existing in the entire system of n variables. The problem 
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is to decompose O° into non-negative terms T° (r= 2, 3, ..., n) each re- 
presenting contribution to C® from that portion of correlation which can be 
attributed characteristically to an r-variable configuration. 

Taking a set of r variables, let us suppose that we observe a subset of 
(r —1) variables and the last r-th variable separately, and we process the 
observed data for these two groups separately. Under these conditions, we 
shall recognize only the correlation 0%”. Then, it will be natural to consider 
o” — 0°? as the correlation characteristic to the r-variable configuration over 
and beyond the correlation existing among (7 — 1) variables. Thus, we shall call 


(14) UM = OM — (Eb — SA + Se-) __ N 


the first correlation index of range r. In virtue of (6), we are guaranteed that 


(15) Ure oO. 

We can see easily that 

(16) om — SS UM 
r=2 


which satisfies the conditions required of 7° at the beginning. 
It is important to see what Ul! = 0 means. In the TS case this happens if 


(17) Ply ony He) = PAL, ya) P(e) 


for all values of the «’s, or if 


(18) p (ay, esis, %r) = p(a,)p°-Y(a,, we Ty) 


for all values of the w’s, and otherwise not. (17) and (18) mean 


(17°) P(r | 01, M, …., 21) = PAC); 


(18') D ae del da) = DD Wa 3) 


where the left hand sides are conditional probabilities. These equalities would 
not hold even if the series were a simple Markoff chain. This can be most 
readily seen by summing over x1, ..., Xi in (17), or over 43, ..., @ in (18), which 
Shows that if U™—0, r>2, then U®= 0 or-p (x, a) = p(x) p(a,).. That 
means that even in a simple Markoff chain, one will get U +0 for r> 2. 
This is not a desirable situation, if one wants to interpret U as the correla- 
tion over and above (r — 1) variable correlation in the TS case. 


In the SS case, (17) implies the (7) conditions which can be obtained 
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from (17) by permutations of variables. In this case too, U®= 0 implies 
Ue) — Ur») = .,, = U®) = 0. However, one cannot condemn (16) simply for 
this reason. Indeed in this case, there is no concept of distance between 
variables, therefore the presence of two-variable correlation may very well imply 
automatically the presence of higher-range correlation. In general, a non- 
vanishing 0 implies that not all U’s can vanish. This in turn means that 
there is a certain value of 7, beyond which all U® are finite. 

The U introduced above is the first difference of C™. The second diffe- 
rence of 0‘ has also an important meaning : 


(19) Wo = Un — UD = OM — 200-9 + (ea — — ge» Lggen_ sw, 


which will be called second correlation index of range r. 
It is easy to show that 


(20) We>o0. 


In the TS case the equality in (20) holds if and only if 


PME DIN PIN 0) 


(21) (ayy ee) dr) = 
P (yy ..., Lr) DAMES. RL) 


for all values of the æs. Condition (21) can also be rewritten as 


(22) pie. | Wes +) Vi) = pari, a). 
This means that the conditional probability of y,= æ, is not influenced by y. 
This can be adequately interpreted as absence of correlation of range r Over | 
and above the correlations that may exist within a sequence of length (r —1). 
Thus the W’s are useful quantities for the TS case. 

In the SS case, W”=0 is equivalent to (21) and to all the equations 
obtained from (21) by permutation of variables. 

It is easy to see that we can use W as TM: 


(23) om = > (n—7r+1)W, 
rm 2 


with S® = 0, this is also a satisfactory development of 0. See, S. WATANABE: 
IRE Profess. Group. Inform. Theor. Symposium 1954, p. 85. 

It is conceivable to use still higher order Arno of C® for the ex- 
pansion of OM, A particularly interesting form is 


(24) om — à WI A 


rng \V 
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where F is the r-th difference of 0, i.e., 


(25) Fo. |- (")s + , DA 4 go-p_ , nr ») Siro ar ET (") sa o 


However, in the first place, F® can be both positive and negative. In the 
second place, the condition F— 0 (which should mean the absence of r-va- 
riable correlation) cannot be clearly interpreted in terms of probabilities. In 
the SS case with g= 2, F‘® calculated for p(111) = 4, p(000) = 4 becomes — 1. 
Thus, expansion (24) is not satisfactory. 

Finally, it should be noted that the well-known expression of p™(#,, @, …, æ,) 
in terms of conditional probabilities: 


(26) PL, Lay vey Vn) = P(Hy) P(e | CPC)... (Hn | Lr, +) Lay) 

leads to 

(27) Sm — GO (8 — KM) +... + (SM — Ge-v) , 

where each term (S‘” — SU-2) has no clear meaning as far as correlation is 
concerned. 


As conclusion, it can be said that in the TS case the expansion (23) in terms 
of W is definitively the most appropriate. In the SS case, the expansion (16) 
in terms of U® has a clear meaning in the following operational conditions. 
When n variables are given the observer starts with examining the information 
content of only one variable, and then of two variables, and then of three 
variables, etc..., at each stage increasing one more variable. When he passes 
from r variables to (r+1) variables, if the last one were completely independent 
from the other 7 variables then the information content would be S©-4S®, 
But when he observes all the (r+1) variables together, he discovers that the 
information content is only S©+. The difference S™+S%— S+b, he would 
call the additional correlation U* peculiar to the (r+1) variables. 

In a similar manner, the quantity (S®-+ S® — 8+) has the following clear 
meaning. The observer first observes a group of s variables and a group of 
t variables separately. The total information is S®+S®. Next he observes 
the (s+t) variables together. Then he gets information S+?. The decrease 
in information (S® + S®— S%+) is to be considered as the correlation existing 
between the group of s variables and the group of ¢ variables. 

The domains of application of the present results are numerous ranging 
from physics to psychology, sociology, etc., just to name a few. In physics, 
an n-fermion system has to obey the Pauli exclusion principle. This principle 
is a correlation in the present sense, since if a particle occupies a certain quantum 
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state, then other particles are prevented from occupying this quantum state. 
The two-body correlation C® due to this effect can be estimated to be 


log [2n/(n —1)]. 


Dr. Nancy ANDERSON and Mr. JoHN Ross of IBM Research Laboratory 
are investigating the relationship between the correlation in the sense of this 
report and the correlation coefficients in the usual sense, having in mind ap- 
plications to psychology. 
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On the Detection of Gaussian Signals in Gaussian Noise. 


D. SLEPIAN 


Bell Telephone Laboratory - Murray Hill, N.J. 


1. — Introduction. 


The problem of detecting a Gaussian signal in Gaussian noise has been 
discussed by a number of authors during the past decade. A recent paper 
by MIDDLETON [1] on this subject contains references to much of the earlier 
work. Here we comment on several aspects of this problem generally over- 
looked in the past. — 

The problem treated can be stated as follows. An observer has available 
to him a sample of finite duration, x(t), 0<t< 7, of a stationary Gaussian 
process. It is known to him that x(t) is either a sample from the Gaussian 
ensemble A with mean zero and power density spectrum g,(f) or a sample 
from the Gaussian ensemble B with mean zero and power density spectrum 
y;,(f). The observer is to decide whether x(t), 0<t<T came from A or B. 
In most engineering applications, A is interpreted as signal plus noise and B 
as noise alone. 

The observer’s decision as to which ensemble x(t) came from can be in 
error in two different ways; he can assert that x(t), came from A when indeed 
it came from B; or he can assert that x(t), came from B when indeed it came 
from A. We denote the probabilities of these two types of error by p, and 
p, respectively. 

The main result that has been obtained by the author is that for the 
spectra generally considered in engineering problems, it is possible for the 
observer to make his decision with vanishingly small probability of error of 
either kind. 

Here we report on one part of our findings. The result described below 
can be generalized and extended. For full details the reader is referred to 
the complete paper to appear in IRE Trans. Profess. Group Inform. Theor., 


June 1958. 


1395 


584 D. SLEPIAN 


2. — Properties of a certain quadratie form. 


From the observed sample x(t), 0<t< 7 form the quadratic form: 


we (jeter) 


j=0 \h=0 


where n= mq, m and q are positive integers. This form uses only the values 
x(jT/n), j=0,1,...,n. We first investigate some properties of y, when it 
assumed that x(t) is a stationary Gaussian process with mean zero and cova- 
riance function r(t) = Ex(t)x(t ++). We shall further assume that the spectrum 
of the process 


ne fe [2mife]r(x)dr, 


has the asymptotic behaviour for large f 


a 
PO) 


where s is a positive integer. 
The expected value of y, is 


RAPT ESTONE Ee BD 
Mr SE E (sr 


On introducing the Fourier integral representation for r, this can be written 


nr | en >> SEG 7) HD exp [2 ri — k)f (Tn)] = 


id 228 pan —1 dc ni Le 22m 72m-1 7 né 2m né sin È 2m 
D oa ER fo) (OT 


- © co 


It then follows easily from the asymptotic behavior of g that if s > my 


lim Fy, = 0, 
while, if s = m, di 

lim BY nm = ac, , 
where da 


10 
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The variance of y, can also be computed in a straightforward manner. One 


finds 
nni fm) (m\ (m\ (m 
SS t— a kal + pi 
vene) ZAR 
u,v=0 


Gem +prima tin} + 


+r {(G@—o)m +) ah? {to c)m+1= y] ||. 


n 


Introduce the Fourier integral representation of r, interchange summation and 
integration and perform the sums. There results: 


Vary —=2"-(5) fufarenen 


E amalî—f') Tal 


9 Sp! , e bi i 
mani sin af 7 sin af Al o 


or, after an appropriate change of variables, 


gam-1772m-2 T2 Pr LIA LA ROT a i 
dan sar se | NI ne (er) vx) (nà) om 


È e £ sin 7" È mq(é — nl 
SARAI sin m(§ — n) 


We now show that lim Vary,, — 0. Since all factors in the integrand of (2) 


q—>0 


are non-negative and since f°”g(f) is bounded from above by our assumption, 
it follows that 


as 
(3) Var Yma < q ha ’ 
where d does not depend on n or q and 


C1 fy ef 4, [sin (Elm) ae 
le: mur 


— œ — © 


sin q(§ — ap 
sin (€ — 7) |° 


Now 


(4) sms Di 


Tne (ce) Gra 


G+Dr (k+1)x 


cme (€;/m) sin ( nim)" 


sin g(&; — al E: 
E;]m xlm 


sin (€; — yx) 


sin gæ|? 
sin x 


gl) ? 


sf L ic sin ( ) 


a ANALI Me) Mn) = = | de 
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where 
> SE 
n= À Pao] 


and 


22-2 


ato) — fam Ey +29) 


x 


= (y— 2) 


The square bracket in the last integral of (4) is the Fejer kernel studied in 
Fourier theory [2]. As q co, h,>2ag(0+) which is finite. From (3) it 
follows that Var y,, >0 as q — co. 


3. — The detection problem. 


The preceding paragraphs show that if p~a/f*, then y, of (1) con- 
verges in probability to zero if s > m, and to a €, if s= m. Consider now the 
problem of distinguishing between the Stationary Gaussian ensembles A 
and B where y,~a/f and 9,~b/f"*”, p>0. Form the test function (1) 
from the observed samples æ(jT/n), j = 0, 1,...,”, n= mq. If p> 0, choose 
a threshold y between zero and c,,a. Use the decision rule « sample came 
from A if y» >y; sample came from B if Yam<y.» If p=0 and a>b, 
choose a threshold y between c,a and c,,b. Use the decision rule « sample 
came from A if y > y; sample came from B if Yam <y.» If p =O and a < b, 
reverse the decision rule. By choosing q, and hence n, sufficiently large p, 
and p, can be made arbitrarily small. 

It is to be noted the foregoing results say nothing about the case in which 
Py and y, have identical asymptotic behavior. It is not known in this case 
whether or not sequences of tests based on x(t), O<t<T can be con- 
structed which make p, and Ps arbitrarily small. This case is often met in 
engineering applications in which A is interpreted as signal plus noise, B as 
noise alone and it is assumed that the signal spectrum falls off more rapidly 
with increasing f than the noise spectrum. 

Whether or not perfect detection is possible in this case, the results pre- 
sented show this model to be a poor one for the engineering problem. For, 
one can always suppose added to the signal spectrum an arbitrarily small 
amount of Gaussian signal power at extremely high frequencies so that indeed 
the asymptotic behaviours of P, and p, are different and perfect detection is 
possible. The addition of this arbitrarily small amount of signal power at 
arbitrarily high frequencies is certainly non-physical. Thus by altering non- 


physically significant parameters in the detection model, we can be assured 
of perfect detection. 
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The most obvious short-coming of this model is the assumption that the 
observer has perfect knowledge of g, and g, beforehand. The author feels 
that proper inclusion of the observer’s uncertainty about y, and y, will preclude 
the possibility of perfect detection on the basis of an observation of a sample 
of finite duration, and will result in a model that more accurately describes 
the situation actually encountered in engineering application. He looks for- 
ward to the development of such a model. 
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On the Realizability Problem for Irredundant Boolean Networks. 


L. LOFGREN 


Swedish Research Institute of National Defence - Stockholm 


1. — Introduction. 


We will give a closed theory of a certain non trivial class of Boolean net- 
works, the irredundant networks. 

Two criteria will be derived by which we can decide if a Boolean function 
has an irredundant network or not. Each is necessary and both together are 
sufficient. The two criteria are called the c-criterion and the subrearrangement 
I’, -criterion. 

It is possible to refer the general minimization problem for Boolean networks 
to a series of existence problems of the kind treated here. We will restrict 
the following treatment only to irredundant networks. 

The particular reason for this study of irredundant networks was a more 
general study of networks with a minimum redundancy-number of branches 
for a certain protection against branch errors. 

We mean with a Boolean function a disjunction of clauses. A clause is a 
conjunction of literals (where no letter appears twice). <A literal is a letter 
either affirmed or negated. Only the values 0 or 1 can be assigned to a letter. 
Each Boolean function can be transformed into a standard form, for which 
we choose an indispensable prime implicant form (for prime implicants, see 
QUINE [3]). 

We will always refer an irredundant network to a Boolean function of 
such a standard form. We Say by definition that a network is irredundant 
if there is a 1: 1 correspondence between its branches and the different lite- 
rals of the « corresponding » Boolean function of standard form (the literals 
of a standard form are irredundant). After having derived the c-criterion, 
we will give a more precise meaning of the word « corresponding » (Sect. 3). 

Giving the sub-rearrangement criteria of Sect. 4, we also give the so- 
lution to the following topological problem: How can a matrix of incidence 
(with elements reduced mod 2) which corresponds to a linear graph be cha- 
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racterized. (Or: which matrices can be reduced so as to contain only two 1° 
in each column.) This problem has been put forth by Oxapa [2] and by 
SESHU [4], but to the author’s knowledge no solution has been published so far. 

In the following we will be concerned only with 2-terminal networks. 
The extension of the treatment to n-terminal networks is quite straightforward 
(will be published elsewhere). 


2. — The Veblen incidence matrix H and the loop matrix L. 


Let us consider a graph of a Boolean network with c branches and r vertices. 
It is uniquely represented (except for row permutations) with the incidence 
matrix 


(1) H = i FI neal, 


with e columns corresponding to the branches (a, b, c, d, ...) and r rows cor- 
responding to the vertices (1, 2,3,...) so that 7, is 1 if the i-th vertex is 
incident with the j-th branch, and 7,;; = 0 otherwise. Thus the 1-th row of H 
is the symbol tor the set of branches which are incident with the vertex 1. 
The j-th column is the symbol for the point pair incident with the branch 7. 
Since each branch is incident with exactly two points, every column must 
contain two 1’s. Conversely, any matrix whos elements are 0’s and 1’s and 
which is such that each column contains exactly two 1’s and each row at least 
one 1, can be regarded as the incidence matrix of a branch-vertex graph (not 
necessarily connected, however). 

Let us denote a path between two terminal vertices with the row matrix 


Di EI 
(2) P2=\0 1% 1 0 +1 


with 1’s in those columns which correspond to branches of the path and with 
0’s in the other columns. The matrix product H-P (with addition mod 2 
and ordinary multiplication), where P is the transpose of P, will then be a 
column matrix with only two 1’s, one for each of the two terminal vertices, 
and with 0’s in all other rows. This because the i-th row of H (the symbol 
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for the set of branches of the graph incident with point 7) will produce a 1 
in the product if and only if there is an odd number of branches contained 
in P which are incident with the point à. This is a general criterion of an end 
point in a path (which also may contain loops). 

Let us then with P represent all paths between two terminal vertices of 
a graph (one row for each path). The product H -P will then have two strings 
of 1’s in the rows which correspond to the terminal vertices (and 0’s elsewhere). 
By adjoining to H and to P an extra terminal branch 7 between the two 
terminal vertices, the path matrix becomes a loop matrix L, and H will be 
denoted H5 (the index 2 refers to the fact that each column has exactly two 1’s). 
The matrix product 


(3) H, L,=0 


is now a matrix in which every element is a 0. 
Let us now consider a Boolean function of a standard form, for instance 


B = abc L ad’ L …, 


and let a branch, say b', corresponding to the literal b', have the value 1 if 
the branch (switching element) is transmitting (« make ») and the value 0 if 
the branch is non-transmitting («break »). Thus «make» and «break» be- 
tween the end points of two series-connected branches a’ and b’ is represented 
by a'-b', and for a parallel connection the value a’U d' is obtained. (Com- 
pare SHANNON [5,6]). So a Boolean function uniquely specifies a loop ma- 
trix L, in which the rows correspond to the clauses of B with a 1 in the 
T-column, and 1’s in all those columns which as literals constitute the clause 
(and 0’s in the other columns). 

So in order to investigate the existence of an irredundant network for B 
we have to solve equation (3) for H}, when L, is known. 

Let us consider the complete solution H, of (3). It is evidently a group Pe 
under addition mod 2, for if « and B are two solution elements, so is «+8 mod 2. 
Further the identity element of G, is a string of 0’s (which obviously also is — 
a solution of (3)). The inverse of an element of G, evidently is the element 
itself. Finally since the operation of addition mod 2 (digit by digit) is com- 
mutative, G, is an abelian group. 

If we instead consider H5 of (3) known and solve for L,, the complete 


x j i Q | D: « 1 1 N 
solution G, is also an abelian group under addition mod 2. 


3. — First kind of realizability criterion (the c-criterion). 


When determining G, from (3) with L, specified, we must require that the - 
part of G, which actually forms the H;-solution must be such that we from 
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this part (and (3)) can determine an L, which is in acceptable agreement with 
the specified L,. We mean with an acceptable agreement between Z, and L, 
that the corresponding sets B’ and B satisfy 


(5) BAR EB. 


B, is a complete redundant form of B, i.e. B, contains every element that. 
implies B (beside the B-elements, also elements which subsume B-elements 
and other dispensable clauses) and vanishing elements containing a letter both 
affirmed and negated. 

Let us denote with /, a generator set of G,. Let us further transform 
I’, under element addition (mod 2) so as to contain a single element y with 
a 1 in the 7-position and a rest, the set /. Each element of /" has a 0 in the 
T-position. This division of J", into y and 7, 


(6) Re wen 


is always possible since J’, is a generator set. In order to obtain L, from G, 
we divide G, into a subgroup g generated by /" and a coset c: 


(7) ¢=y-+g (mod 2). 
Evidently all elements of the coset c have alin the T-position and thus. 


(8) Lo, 
Evidently c covers L, and from (5) we obtain the following necessary cri- 
terion: 


c-criterion: A necessary condition for a Boolean function 6 of standard 
form (specifying L,) to have an irredundant network is that the coset e (7) 
of the complete group generated by L, is contained in the complete redundant, 
corti BOL B: 


(9) B,,56=y+9 (mod 2). 


(It should be observed that ¢ contains multiple-loops through 7°. These 
are not dismissed from the criterion because a corresponding clause in B 
always subsumes a clause which corresponds to a single loop through 1.) 

Evidently any complete independent set of L, can be chosen for the gene- 
rator set J’, (6). The coset is still the same. 

We mean with networks «corresponding » to a Boolean function 6, all 
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those networks (H?) which have the same € (determined from B according 
to the above) and which can be determined from (3). In the next section 4 
we will see how to decide if any such networks exist, and if so, how to deter- 
mine all of them. 


4. — Second type of realizability criterion (the subrearrangement criterion). 


Evidently a solution H7 of (3) must contain so many elements of G, so as 
to determine G, from the same equation (3). This implies that H7 must con- 
tain at least a generator set 7°, of G,. Furthermore we must require that 
each column of H7 contains precisely two 1’s. This implies a relation between 
the elements of H7. Since all (say r—1) elements of a generator set are 
independent H7 must at least contain r elements, the 7-th being the sum 
(mod 2) of the 7 — 1 generator elements. 


Lemma 1. If an irredundant H5-solution of (3) (with L, specified by 77) 
exists, it must contain one of the generator sets J’, (each having r—1 ele- 
ments) of G, and one further element (the r-th element) being the sum mod 2 
of the elements of /7,. 

The proof will run as follows. Suppose that H consists of a generator set 
(with elements y,, ys, ++ y.) and of the k+1 depending elements Ves Vota ee 
Yrix- The corresponding k+1 relations may be written: 


Zoy:= 0 (mod 2). The set of elements covered by X, is denoted So 


on Ye 0 » » 


i 


(10) 


ie oe 
Za Vi 0 » » di » Sx 


It is possible to show that we can always write the relations (10) in such — 
a form that all the sets 8, are disjoint (and so that their union covers all the 
r+k elements). (Compare also VEBLEN [7]). Recalling that an element of H; 
represents the branches which are incident with the corresponding vertex, 
this means that each branch which has one vertex in the set S; must also have 
the other vertex in S;. Since the sets S, are disjoint, the graph must consist 
of k--1 disconnected subgraphs. But all the branches of a disconnected sub- 
graph which does not contain the T-element must be redundant. For the 
clauses of B, i.e. the paths through 7, are only determined by the branches of 
the subgraph which contains the T-element. Hence an H7-solution (an irre- 
dundant network) cannot contain disconnected subgraphs and % must be zero. 

The question is thus referred to: which I’, sets of G, are I’,,.-sets, i.e. contain 
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at most two 1’s in each column? (By adjoining an r-th element according 
to lemma 1, each column will then have precisely two 1’s.) 
There are 


(11) N(r) = (27-1— 20) (27-1 — 21) (97-1 — 92)... (211 2-2) /(p — 1)! 


different generator sets corresponding to a group generated by r—1 ele- 
ments. For example N(7) = 28 901376. 

It is useful to observe that any generator set of G, can be obtained from 
a specific generator set J, only by repeated operations of the type: replace 
one row by the sum (mod 2) of this row and some other row (a row-replace- 
ment operation). 

Generally it is quite easy to convert a J',-matrix into /°,.-form by row- 
replacement operations, 7.¢., if a /’.-form exists. A strategy in the beginning 
of the operations is to add two rows which have many 1’s in the same columns 
together, and to replace one of them by the sum so as to obtain an increase 
in the number of 1’s in a number of columns as small as possible. 

However if no /,.-form exists, we must know exactly that none of all the 
N(r), see (11), /,-sets is of J’,.-form. Such an information is nicely obtained 
from the subrearrangement criterion to be derived in what follows. Also if 
a I’,.-form exists but has not been found by the row-replacement procedure, 
it is obtained from the subrearrangement criterion, which in addition gives 
all H}-forms that correspond to the coset c, if more than one exist. 

The idea behind the subrearrangement criterion is to derive a subproblem 
whose solutions will give character to the main realizability problem. We will 
prove that it is possible to give the subproblem essentially the same nature 
as that of the main problem, but with immediately obtained solutions, which 
restrict the number of investigations for the main problem to the many- 
valuedness of the subproblem. 

An H-solution must be a non-separable graph, but for the subproblem 
we must admit for solutions, H?, which may consist of separable loop-graphs. 
We will say that a subgraph is tied to another subgraph if it is connected to 
it at three or more vertices. If it is connected at only two vertices it is loosely 
tied, and if it is connected at one or no vertex, the graph is separable. A 
single branch with its two vertices will not be considered as a subgraph. 

For the criterion we need the following lemma. 


Lemma 2. A matrix of incidence H? (of r rows) of a loop-graph (a graph 
where each branch is incident with a loop) is defined by its /-matrix which 
consists of »— 1 arbitrarily chosen rows of H? (the r-th row is the sum of 
the rows of J’). The elements of J’ need not be independent. All possible proper 
row-replacements on J’ which maintain the H?-form (two 1’s in each column) 
are repeated applications of the following types. (A pure row-reordering row- 
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replacement operation is considered improper, because the arrangement of 


the elements in an incidence matrix is without significance.) 


i) If H? contains loosely tied subgraphs, say H, which has only the two 
vertices 7; and 7; in common with another subgraph, then the row-replacement 


exc Ty, Ty 


Ti >» (r,) » 


Ay 


exe Ty, Ty 


Yon A tie > (Ple 
Hi 


(r, is replaced by the sum mod 2 of itself and all the vertices of H, except 7; 
and 7;, and similar for 7;) is possible for any loosely tied subgraphs of H?. 
This operation corresponds geometrically to a change of connectivity of H,. 


ii) If /' contains (non-zero) elements which depend on its other non-zero 
elements according to the equation system (10), and contains zero-elements 
7,(0), then any row-replacement 


rr>t+Z+ r,(0) 


is possible, where r; is any row of I’, X; is any disjoint zero-sum of (10), and 
r,(0) is any zero-element. 


iii) If H? is separable: 


a) If disconnected subgraphs exist say H, and H,, incident with the 
vertices 7, and r, respectively, these subgraphs can be joined at a single vertex 
(cut-vertex) 7, (formed when 7, and y, coalesce). The corresponding row- 
replacement, 

Tre TRE = Tu » 
DT, = 7(0) =00...00, 


H 
is possible for any pairs of disconnected subgraphs and for any of their vertices. 


B) If a cut-vertex r;; and a zero-row r(0) exist, the cut-vertex can 
be split into r, and r, belonging to the generated disconnected subgraphs H; 
and H;. The corresponding row-replacement 


exe Ty; 


(0) > (0) + > (7,) = 15 


Hi 


Ta. > rate 


is possible for any zero-row r(0) and any cut-vertex ij. 
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y) It a cut vertex 7;; exists, we can change it to ry (operation f and a) 
also without having a zero-row to disposal by the row-replacement 


Tr le TV, = Tr 

exc Tij 

Vig > Lag > (r,) =% 

Hi 

exe Ty 14; 

Ui FM Er >» (r,) = 7; 
Hi 


H, is the separable subgraph of H? which contains r, but not r,: When 7, 
and 7, coalesce, r,, is formed. At the same time r;; is split into r; and 7;, 
where 7; but not r; is contained in H, after the change of cut-vertex. 


Proof. The proof will run as follows. An H?-matrix determines a graph 
uniquely and conversely, provided the matrix has only non-zero elements. 
H? specifies uniquely a complete set of loops (G,), and all other matrices with 
the same G, (or with loops in 1:1 correspondence with the loops of H?) must 
be obtained by row-replacements on H? (a row-replacement on H? does not 
change G,). All matrices with loops in 1:1 correspondence with the loops 
of H? are obtained by connetivity changes of H?-subgraphs corresponding to 
(i and iii). This follows from a theorem of WHITNEY [9]: 


«If there is a 1:1 correspondence between the branches of the two graphs 
G and G' so that loops correspond to loops, then the graphs are strictly 2-iso- 
morphic. » 

Two graphs G and @’ are 2-isomorphic if one can be transformed into the 
other by the connectivity changes: 


i) If G— H,+H,, where H, and H, have just the vertices 7; and 7; in 
common and these vertices are connected in both H, and H,, then H, is turned 
around at these vertices. 


iii) Break a graph at a single vertex into two connected pieces, or join 
two connected pieces (not connected to each other) at a single vertex. 


There is only one proper independent row-replacement for each change of 
connectivity in the H?-graph if /’ is a generator set. These are easily found to 
be the ones listed above (i and iii). However if J’ has depending elements, 
the row-replacements of ii are possible. For let us represent a row-replacement 
by a square transformation matrix T of r—1 rows. Suppose a certain change 
of connectivity transforms the matrix I’ to I”: 


(12) Pie ht 
Suppose there exists another matrix 7’ for this transformation. Then: 


(13) (T+ T')T = 0 (mod 2). 
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If J'is a generator set the only solution is T'=T. If however /" has k 
dependent rows according to (10) and also contains # zero-rows r,(0), we first 
transform J’ so as to contain k-+n zero-rows and a proper generator set J” 
(of r—1—k—vn rows). For this transformed J’ the solution of (13) is: 


k+n 
IAA ale 00 


(14) 


= 
I 
8 

e's 
© 
= 
© 


MAT RAC 
KISS, 050,290 
F1 =n 


where the indicated matrix has k-+-n arbitrary columns and (r—1—k—n) 
zero-columns. This means that we for any allowable row-replacement 7 also 
can replace any row by itself and any zero-rows r,(0) or any zero-sums J; 
according to ii. 

We are now able to prove the following constructive /7,.-criterion. 


The subrearrangement l’,.-criterion. — A T,-matrix is converted by row-re- 
placement operations so as to contain at most two 1’s and at least one 1 in 
v' of its e columns and more than two 1’s in the remaining »” columns. A cor- 
responding /,-matrix is transformed with row-replacement operations so that 
(Z’,),» (the v’-columns of /7,) is converted into independent rows and zero-rows. 
The rows of the so transformed J',-matrix which contain these independent 
rows of (/7),. are deleted, and the remaining rows are denoted I. All non- 
empty columns, say »’, of (77), are denoted (Ts The same columns of 
I, are denoted (l'y. The only row-replacements which may convert I, 
into /°,-form are those which correspond to rearrangements of subgraphs 
according to lemma 2 with [= (L’,). If the r-th row of (H2),,, which is not 
present in (/7,),, is involved in a row-replacement, a row 7, of (Lys which 
is not involved, is replaced by the r-th. (The »'-»’-division is unaffected by | 
this replacement). Should there be an empty column of /7,, no J’,,-form of 
T, exists. 


Proof. Let us first consider the case when there is an empty column of 
I,. Then any other form of I, must also have only zeroes in this column, 
and we know from lemma 1 that no H7-form exists. Let us then consider the 
division of the columns into a v and a »” (=¢—~»’) part. We will prove that 
Lo as defined in the criterion is a T-set (compare lemma 2) of an H?-matrix 
corresponding to a loop-graph with all its loops in 1:1 correspondence with : 
all the loops of H? which are only incident with the »'-branches. Since (POE 
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has exactly as many rows as /7,, and since a row-replacement on (/7,), ope- 
rates in the same way on /7,, it follows that the row-replacements according 
to the criterion are the only ones which can possibly convert I’, into 7’,.-form. 
Other row-replacements on (I’,)» which conserves the (/7,.),-form can be ex- 
pressed as further improper replacements, and an improper replacement cannot 
diminish the number of 1’s in a v'-column to two or one. Let us now denote 
with G, the group (with the zero-row deleted) generated by / and the rest 
of the group G, with G7 (with the zero-row deleted). The v'-columns of Œ, 
and G, are denoted with (G,),, and (G,),. The remaining columns of G', and 
G, are denoted with (Gr) and (Gi) respectively. Equation (3) is with this 
division equivalent to 


(15) T, 8, =0, 
(16) Fr Er 


. ! . . 
Since (G,),, has only zero-elements, (15) implies 


(17) (AG = 0 


and from (16) we obtain 


A 
| 
23 
a 
GLS 


(18) Nara 


Hence when we want to determine J’, from (3) this is equivalent to a de- 
termination of the (/7,),-part from (17) only, and then the remaining (/°,),.- 
part is determined from (18). Hence we have referred the determination of 
(;,)» to a problem of the same character as the problem of determining H% 
from (3) with a specified group G,. We know that (4), is a proper loop- 
group, because it consists of all those loops of G, which are only incident with 
the »'-branches, and it contains no zero-column. Hence the graph corres- 
ponding to (/7,), (obtained by adjoining an r-th row) must be a loop-graph, 
not necessarily connected however since (l,)» may contain dependent rows 


and zero-rows. (I',),, is, by definition, of (/:),-form and it follows from 


lemma 2 ((/,)» =) that the sub-rearrangement criterion is correct. 

When trying to transform a /’,-matrix into J/’,.-form, this criterion can 
always be applied and is an effective short cut. However some trial and error 
is involved when it is used to investigate the existence of a I’,.-form. We 
will therefore only use it in deriving the subrearrangement J’,-criterion. This 
criterion is the desired solution to our problem. It is very easy to apply be- 
cause it is systematic and only contains a few (4) application of a single ope- 
ration. (u=c—r+1, the number of loops of /’,, is also called the cyclomatic 
number.) 

Let us state the criterion: 
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The subrearrangement T';-criterion. — A necessary and sufficient condition 
that a loop generator set 7} — in an explicitly independent form — has a 
pair J’, of J’,.-form, is that a connected subgraph which corresponds to an 
arbitrary selection of loops (rows) of J’, can be i) — and iii-y) — rearranged 
(see Lemma 2), so that the branches of the subgraph which are contained in 
another loop of J’, form a single path. The connected subgraph which also 
contains this new loop is obtained by connecting to the two end-vertices of 
the path, the other branches of the new loop, connected in series. If no branch 
of the loop is contained in the previous subgraph, the new loop is connected 
to the subgraph at one arbitrary vertex. If the process is started with a single 
loop of /", as subgraph, it ends up with a desired H?-graph, or if a path- 
rearrangement with operations i) or iii-y) is impossible, then no H?-graph exists. 
The order in which the loops are taken is arbitrary. Also if more than one 
rearrangement is possible, it is without significance which one is chosen. 

Let us for the proof consider a I’,, I',-pair 


8; 
SA 
— 
I, HN Esa e ORARIA TI 
0 1 0 0] | y 
REA (rd pe O0SO0SUEUSO (Tra 
3 UL rt 
(19) Te = ji — —_'! 0 0 0 0 0 ] 
lea 2e] 02 0€0 2100 0 EE MAUR 
SG ES à ber a ES 
020007 00M hee i 
Vi vi y! 
Vi vi 
Vian Vizi 
F rar 1 0. DO 0-6 O00 10 
| 0 0 0 0 0 0 0 0 0 
| || 0 170: 020 01000 
0.05302 0.02000: 6 
0.00 040 STEREO 
(20) ga È = 
r— 5 001 000001000 0||r,;(0), 
r — 4 001 0-0 00-00 10°06 Ti,2(0), 
r—3||00 01 00000001 0 offr,,(0), 
r—2 000 0 0 0 0 0 0 0 0 1 Offr.4(0), r3,1(0) 
r—1][0 0 0 OLILJ/0 0 0 o 0 0 0 © 0 1/1r,,(0), 44 (0) 
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I”, is transformed into an explicitly independent form, è. e. with y different 
columns only containing one 1. The corresponding (compare (3)) pair I’, (20) 
is then immediately obtained in explicitly independent form. The rows of 
the right part of J’, are the columns to the left of Hees 

We will temporarily assume that it is possible to construct a loop-graph 
corresponding to some of the rows of I°,, say the à first (in (19) the three 
first). Let the y — à columns of the independent part of I’, which have 1’s 
in the remaining uè loops of 7°, be the »’-partition of the 7”, “criterions 
We denote it with v, and we use the index à also for the v’ pyar ves vi , There 
may be a number of columns 
of (Z,), which are empty, 


say the s, last. Withdrawing or; (0) of1(0) | @ Gx 
them from », we have the (a) of; (0) or; 2(0) Gj G 
vi-partition as indicated in ° . 

(19). The (lyhr-matrix con- a € b È c 

sists of s; zero-rows 7;,(0), Fig. 1. — Possibilities for an H?-graph. 


T:2(0), ..., and the other rows 

are explicitly independent. According to our assumption, (I',),; can be trans- 
formed into /°,.-form. There are two possibilities for the corresponding subgraph. 
Hither the graph G; of the independent rows is nonseparable (Fig. 1-a) or it 
is separable (Fig. 1-b). If for instance in the last case there are s; cut-vertices 
and s; zero-rows, the graph can be split iii-8) of Lemma 2) into s;+1 dis- 
connected pieces (Fig. 1-c). In order to diminish the number of types of 
rearrangements, we can always require that G; shall consist of only one piece. 
Even if G, is separable and consists of disconnected pieces we know that the 
final H?-graph shall be connected, but we do not know a priori how the pieces 
shall be connected. But if we connect them arbitrarily ili-x) we can always 
change their connectivity in any desirable way only with the operation iii-y) 
of Lemma 2. 

Let us next consider the (J’,),\,,-matrix, which corresponds to the same 
single loops as before and one further, say the (¢-++1)-th loop l,+, of 1%. The 
%,41-partition is indicated in (19) and (20). There are two kinds of possibilities. 

1) The s,-columns to the right in J’, are still empty in the first 1+1 
rows. Then (J’,)vj,: has again s; zero-rows, and the other rows are independent. 

2) One or more (8; — 8:41) of the s;-columns are non-empty in the first 
4+1 rows of 7,: The (I,)x;a-matrix has 8,41 zero-rows and the other rows 
are independent. (In (19) s;—$;:1— 3.) 

Let us first consider case 1). We know that (Lu) has a (1,)v;-form. 
(Tor, only differs from (/:)»; in that it contains one further A the . 
(i+1)-th. The content of this column is obtained by the product 


(21) (Dr (La 
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for we must have (compare (17)) 
(22) (Ts Wed 


(21) means that the (i+ 1)-th column of (J), has a 1 in each 
row that corresponds to an end-vertex of a path-piece formed by the 
l;,,-branches that are incident with the G;-graph. It now follows from the 
I',-criterion that a necessary and sufficient condition for (/7,)»;x, to be of 
T',-form is that it is possible to rearrange the connected G,-graph with ope- 
rations i) and iii-y) so that the branches of it which are incident with the loop 
l;+1 form a single path (with two end-vertices). The G,.,-graph is simply ob- 
tained by connecting the (i+1)-th branch between the two end-vertices of 
the path. 

Let us then consider case 2) where s,,,+8;, i.e. (Ly vias contains (8; — 841) 
non-zero rows more than (/7,) and the vi+1-partition contains (8; — 8;,,; +1) 
branches more than the vi-partition. (21) gives the content of the (i+1)-th 
column of (/,)»;.,, here however only in the positions corresponding to the 
non-zero rows of (Tx Thus we see that if the path-pieces of the 1;,,-branches 
which are incident with the G;-graph can be rearranged into a single path in 
G; with operations i) and iii-y) then we can transform (l'y) into J’,.-form in 
the following way. Row r—4 (see (20)) is added to row r—3. Next row 
r—5 is added to row r—4. Finally row r—5 is added to a row corres- 
ponding to an end-point of the path in the G,-graph. The corresponding geo- 
metrical construction is that we connect to the two path end-points in G, a 
path consisting of a series-connection of the remaining new loop-branches 
of 1,,,. The sufficient path-condition on G,; is also necessary for we know that 
all (I) -Matrices are generated with operations i) and iii-y) of Lemma 2 from 
the one obtained. If the G,-graph is separable it might be possible to rearrange 
the branches of the 1;,,-loop so that those branches which are incident with 
the @,-graph form more than one path. But then the G,-graph is not con- 
nected and it shall be according to the criterion when the G;:,-graph is formed. 
After that, of course, any i) or iii-y) operation on @,,, is allowable. Finally if 
the intended path-branches of @; cannot be rearranged into a single path, 
then no G;,,-graph exists, nor an H graph. For we know from the Dy eri- 
terion that we can determine Di by determining any (Ta)? -partitions suc- 
cessively (compare (17 )), and we have shown that the above vi-partitions of 
an explicitly independent J°,-set are v-partitions. So if no single path of the 
lii,-branches can be made in G,, the only construction would be to close the 
l4,-branches between the path-pieces in G, (connected) so that the l;,,-loop 
is formed. But then the number of new non-zero rows (real vertices) in (20) 


would be less than s, —Si+1, and hence this construction is impossible. This 
proves the J/’,-criterion. 
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In the criterion the word « corresponding » is used in the same sense as 
was explained at the end of Sect. 3. 

The subrearrangement criterion is really a solution to the Okada-Seshu 
problem (*) (compare Sect. 1), for as we shall see in the next section, there 
is no problem in rearranging intended loop-branches into a single path or in 
deciding whether a rearrangement is impossible. 


5. — Path-rearrangements. 


We will use the following expressions. 


Separable subgraph: a subgraph which is connected to an internally con- 
nected subgraph at only one vertex, a cut-vertex [8]. 


Loosely tied, separable subgraph: a subgraph which is connected to an inter- 
nally connected subgraph at precisely two vertices, and which contains at least 
one « internal cut-vertex » (a cut-vertex if the subgraph is regarded alone). 


Loosely tied, nonseparable subgraph: the same as in the previous case but 
with no «internal cut-vertex ». 


We want to decide if a number of intended loop-branches, say the path- 
branches p;, can be rearranged with operations i) and iii-y) of Lemma 2 so that 
a single path is obtained. The simple 
method to be described is quite 
straightforward. The procedure can 
be divided into a couple of inde- 
pendent steps. 

Let us first consider the case 
when the graph contains a number 
of separable subgraphs containing 
p,-branches. Since such subgraphs 
can only be connected at cut-vertices 
(i.e. so that no new loop is formed) 
it follows that the p,-branches in a separable subgraph must be rearranged into 
a single path, however with no conditions on the end-points, since a separable 
subgraph can be reconnected iii-y) at any vertex. The subgraphs are then 
reconnected so that a path of the p,-branches is formed (see Fig. 2 « and bd). 
The path is connected to an end-point of an eventual p,-path in the main sub- 


a 5 


Fig. 2. — Rearrangement of separable sub- 
graphs for a path-formation. 


(‘) Recently R. GouLD [1] has given a systematic treatment of the problem, how- 
ever far more complicated than the solution given here. 
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graph. For simplicity one could omit separable subgraphs which do not contain 
pi-branches. We have now referred the investigation to a non-separable graph. 

We consider a loosely tied, non-separable subgraph g;, only connected to 
the remaining graph at the two change-around vertices v,, and v,: Even if 
gi and the remaining graph contain other change-around vertices, no branch 
which from the beginning is not contained in g, can be rearranged into Gi. 
So if g; does not contain a p,-branch, it could for simplicity be replaced by a 
single branch. If it does contain p,-branches, these can only be brought into 
contact with p,-branches outside g; at the two vertices v;, and V2: Hence the 
pi-branches in g; must be rearranged so that they form paths in g; with end- 
points at v,, or v2. We will here distinguish between two cases: 


1) All p,-branches outside g, form a single path with its two end-points 
at v,, and v, respectively. 


2) All other cases concerning the p,-branches outside gi. 


The distinction between the two cases is immediately recognized. If a 
Pi-path exists outside g; so that it is incident with both v;, and v,;, then the 
path-piece between v,, and vi, always exists (it cannot be broken by a re- 
arrangement). So if there are no more Pi-branches outside g; than the men- 
tioned path-piece we have case 1), otherwise case 2) for in the second case 
not both v;, and v,, can be end-vertices. Also, of course, in case 2) we do not 
need to have à path between Va and v; outside g;. 

In case 1) there are two possibilities for the pi-branches inside g,. They 
must either be rearranged into a single path with one end-point at Viz OF 
at v;,. Or they must be rearranged into two single paths, one with an end- 
point at v,, and the other with an end-point. at v,,. 

In case 2) there is only one possibility: the p,-branches inside g: must be 
rearranged into a single path with one end-point at v;, or at v, (or with both 
end-points at v;, and v,, respectively). 

If these possibilities for the two cases are not at hand then a resulting 
Pi-path cannot be formed. 

If however a loosely tied, non-separable subgraph does not exist, the whole 
graph must consist of a single loop (after the indicated simplifications). Any 
pair of non-adjacent vertices is then a change-around pair but the corres- 
ponding loosely tied subgraphs are separable for they contain internal cut- 
vertices. In this case it is clearly always possible to rearrange the p,-branches 
so that they form a single path. For simplicity we could reduce the number 
of non-p,-branches in a pure series-connection of branches to one. 

The whole procedure is now simply this. Start with a loosely tied, non- 
separable subgraph g; containing at least one pi-branch and not containing 
a smaller p,-containing loosely tied, non-separable subgraph. If case 1), make 
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the above-mentioned rearrangements. I they cannot be done with a pure 
series-interchange of the branch-order no resulting single path exists. If case 2), 
make the above-mentioned corresponding rearrangement and simplifications. 
If the rearrangement cannot be done with a pure series-interchange of the 
branch-order, no resulting single path exists. If the rearrangement is possible 
the loosely tied, non-separable subgraph should be enlarged to another sub- 
graph of the same kind and the procedure is repeated. An enlargement is 
only possible if there exists a change-around vertex pair Vi41,1) Viz1,2 80 that 
one of these vertices is incident 

with one vertex of the previous n° 
pair Wir, Vie and so that the other CARS É ) 
%:11-Vertex is not contained in the AD ENI 
g.-graph. The process ends up with “ NNT * i x 
a desired resulting p;-path or with NG 

a decision that no such path exists. 

It is of course also possible to start 

from different subgraphs, each 

being successively enlarged. 

Corollaries from this general cS 
treatment are the following tests. 

Tf none of the vertices of a not pe need eae È . 
properly placed p;-branch (end- 3 
points of a path-piece) is incident © d 
with a vertex of a change-around Fig. 3. — Examples of p;-configurations for 
pair, then no path can exist (see path-tests. 

Fig. 3-a). 

If only one of the vertices of a not properly placed p,-branch (end-points of a 
path-piece) is incident with a vertex of a change-around pair (see Fig. 3-b 
and 3-c), and if more than two such branches (path-pieces) exist, then no 
resulting path can be formed. 

If more than two p;-branches are completely tied at a single vertex, then 
no path can exist (Fig. 3-d). 


6. — Examples. 


In the first example given here we will apply the c-criterion and the J’,- 
criterion to a Boolean function in order to investigate if it has an irredundant 
network. In the second example we want to decide if a J’,-set can be trans- 
formed into J’,.-form. For this we apply the J’,-criterion to a J',-pair of I. 
The third example, finally, will illustrate that it is in general necessary that 
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I’, is first transformed into an explicitly independent form before the J’,-cri- 
terion can be applied. 


Example 1. - We want to investigate if 
B= abudfgu ac'deg us cefgube ef 


has an irredundant network. B contains no redundant literals, for it contains 
only prime implicants (compare [3]). The corresponding L,-matrix is 


Ecate 
dg OO 00 
10550, 070% LO ae 

L= E SOU EL ei a 
LO "OF TO ae tee LAI 
Le Die LO TL Dania 

L, is transformed with row-replacements into an explicitly independent 7°, 
form: > 

Pa bi 08 ait debt Sa 
1,00 0 0. TOM gee 
0° Dorie Re ROUES 

Lies 
OO. LO le Lie ES 
O90) OA LATE hte 

I’, generates the coset ec: 
POT DVO IA Le Mg 
LO 200 O° he OT 
Lo COS del De IAN 
LT 0 FINE ia varo 
LO) UA Tot. 0 don er 
(6) — 

Lt) Quy 0. Pr dai 
1° 100 SNS ORIO 
dk is | Peek | METRE 
L 1409 bh Ome enn 
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and hence we obtain for B(c): 
B(c) = Buabed bee dfuace’g . 


The c-criterion is fulfilled, for the two last intersections vanish, and the 
other indicated clause of B(c) subsumes the clause ab of B. 

Let ut now apply the J’,-criterion for instance in the loop-order 4, 3, 2, 1. 
The four steps are indicated in Fig. 4. First loop 4 of I’, is drawn (Fig. 4-a). 
The p,-branches for loop 3, e and d, are already in a single path, and the cor- 
responding graph is drawn 
(Fig. 4-b). In order to 
form a path of the p,- 
branches €’ and e for loop 
2 the series-connected 
branches g and c' are inter- 
changed. The graph also 
containing loop 2 is drawn 
(Fig. 4-c) and the p,-bran- 
ches for the final loop 1 
are indicated. We have 
here three p;-branches 
and two of them, for in- 
stance d and g are con- 
tained in a loosely tied, 
non-separable subgraph 
(indicated in Fig. 4-c). We have here case 2) because no path of p,-bran- 
ches outside the subgraph between its connection vertices exists. Thus the 
two branches d and g must be rearranged into a path in the sugbgraph 
with an end-point at one of its connection vertices. So the series-connected 
branches b and g are interchanged. After that the subgraph is turned around 
and the desired path is formed. The remaining loop can now be adjoined and 
the irredundant graph of B is shown in Fig. 4-d. 


Fig. 4. — G,-constructions for example 1. 


Example 2. Let us investigate if the following /7,-set 


db 6110 GOs te ae oh 

Opel le Or Ore 

TORO acl 0 ISSU TU 
1 

LRO OU SOLE RO 

INR ORSO ORO Qa 
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Fig. 5. — G;-graph 
for example 2. 
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1 060 0 TERI 
0-9 ls 0500 RO 
T, = | 
OO Oe ee | 3 
0. 0. Ost slate Deo 


can be transformed into J’,,-form, i.e. let us investigate 
if the indicated J’,-pair passes the /-criterion or not. 
The first steps corresponding to the three first loops of 
I’, give the G;-graph of Fig. 5 where the p,-branches 
for loop 4 are indicated. 

The branch e is not incident with a change-around 
vertex and since the two p,-branches e and f do not 
already form a path, we conclude that no Iy-pair to 
I’, exists. 


Example 3. — Let us consider the loop generator-set 77: 


abaco fg 

LS 11606001. 150001 

CLOS II MERS 
dr 

Le 0e OL a ON PME 

OL OPON TEINTE ONE 


A corresponding explicitly independent set PRAT 


According to the I’,-criterion we 
must apply it to an explicitly indepen- 
dent form. We want to see however 
what can happen when the criterion is 
applied on the above l’;-form. After the 
three first steps (the three first loops of 
T',) we obtain the Gi-graph of Fig. 6-a. 
The process cannot be carried out further 
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because the fourth loop does not contain any new branches. Although all 
the loops of G; are correct (compare the complete graph of Fig. 6-b which is 
obtained when the criterion is correctly applied on T), the structure of G, 
has nothing to do with the structure of the complete graph. 

We want to point out also that it can happen that there is a new branch 
at each step of a /",-form which is not explicitly independent. Anyhow the 
I’,-criterion should not be applied until J’, has been transformed into explicitly 
independent form for it can otherwise happen that a row of I’, represents a 
multiple loop. 


This work has been supported by the Swedish Research Institute of Na- 
tional Defense and by the publishing-house Natur och Kultur, Stockholm. 
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The stimulus reaction systems which have been studied most intensively 
by physiologists are threshold systems, or so called all-or-one systems. They 
give no graded response to a stimulus input signal. The analysis of such 
Systems is therefore a very difficult one. 

The situation is obviously much simpler when the system under consi- 


deration answers in a graded manner to stimuli. Here one can be more hopeful 


to carry the analysis of the transforming filter down to the differential equations 
involved. 

Such a system which responds in a graded manner to light stimuli is the 
growth organ of the sporangiophores of Phycomyces. It serves as a receptor 


to stimuli, evaluates the absorbed information and serves finally as an effector 


system. 

The sporangiophores of Phycomyces are parts of single cell systems. Each 
of them carries a sporangium on its upper end. Growth is confined to a zone 
3 mm in length immediatly below the sporangium. In the growing zone the 


cell wall stretches and new wall material is built into the « gaps ». The grownig | 


zone stretches without getting longer. For every amount that it grows in length — 


a corresponding amount at its bottom is converted into secondary cell wall 
which has ceased to grow in length. In this stage of growth—after formation 
of sporangium—a steady growth rate is maintained for many hours (ERRERA, 
1884 and CASTLE, 1942). The rate amounts about 3 mm per hour. The stretch 
along the growing zone is not distributed in a homogeneous manner. The 
strength of stretch is maximum 0.5 mm below the sporangium, stays on a 


nearly constant plateau between 0.7 and 1.9 mm and finally drops down 
(COHEN and DELBRUECK, 1958) 


The sporangiophores are positively phototropic. When they are exposed Ì 
to light from one side they grow towards the light. BLAAUW (1914) discovered — 
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another effect of light on the growth of the sporangiophores: the light growth 
response. This response refers to the situation in which the specimen is at 
all times symmetrically illuminated from two or more sides. If the illumination 
is Symmetric with respect to the vertical axis and if the specimen is growing 
vertically at the start of the experiment this illumination will not cause it 
to deviate from vertical growth. If the illumination is kept at a constant 
intensity for a certain length of time the rate of growth is also constant and 
is the same whatever the intensity. However a striking and transient change 
in growth rate occurs when the intensity of illumination is changed. 

We shall deal first of all with this effect of the so called light growth res- 
ponses. In this case the input variable is /(t), the light intensity as a function 
of time. The output variable, we measure, is the growth velocity v(t) of a 
sporangiophore. The connecting process which transforms I(t) into v(t) is the 
unknown filter (DELBRUECK and REICHARDT [1]). 

In order to analyse the filter process we submit the growing zone to dif- 
ferent stimuli, measure the growth reaction and draw conclusions from these 
functional data on the filter process involved. The first mentioned experi- 
mental fact is 


(1) VV, = Const if I(t) = const. 


The rate of growth is constant if the intensity of light is kept constant 
whatever the amount of intensity. In the second step we raise the question 
whether the transformation between J(t) and v(t) is a linear one or not. Let 
us call the first test stimulus—which refers to a special change of intensity — 
1,(t) and the connected growth velocity deviation Dv,(t). The second stimulus 
(submitted to the growing zone after the reaction to the first one had died out) 
we call /,(t) and the connected growth velocity deviation Dv,(t). If the trans- 
formation between J(t) and v(t) would be a linear one, we should find that 
a superposition of the two stimuli should be followed by a reaction being the 
superposition of the single reactions to the programs J, and J,. The experiments 
showed to be in flat contradiction to this; the superposition rule does not hold. 
It means the filter process between J(t) and v(t) is obviously a non-linear one. 

We consider now a special light program. The plant was illuminated with 
constant intensity 7, for a while (for inst. 30 min). At time t= tt, we super- 
imposed a light flash during a time At (for inst. 158). More exactly the 
intensity I, jumped to the intensity Z,, was kept constant for the time At and 
finally switched down to the original intensity /,. The growth velocity reaction 
to such a light program showed the following behaviour: after the stimulus 
growth continues at its normal rate for 2.5 minutes, then increases for a few 
minutes to a maximum which may be twice as high as the normal rate. Pre- 
sently it decreases again, goes below normal, and returns to normal by about 
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15 minutes after the stimulus. The net gain in growth due to such a stimulus 
is zero, i.e. the transient increase in growth rate is compensated for by the 
subsequent fall below the normal level. The stimulus does not produce extra 
growth, it simply alters the distribution in time of the growth that would have 
taken place during the same period in the absence of the stimulus. 

If one repeats the last experiment, keeping the adapting intensity I, con- 
stant, but varying the intensity or duration of the stimulus, we find that the 
reaction is a graded one. Secondly we find that a change in the stimulus does 
not alter the latent period. Thirdly we find that the shape of the response 
curve is independent of the stimulus except for very large stimuli. Fourth 
we find that the responses depend on the product intensity x time as long as 
the duration of the stimulus does not exceed the range of one minute. 

From what we have just said about the general characteristics of the res- 
ponse as a function of stimulus, it is clear that it is not necessary to measure 
the entire response curve each time in order to get a measure of the intensity 
of the response. We have chosen as a measure of the response the ratio of the 
growth during the period from 2.5 to 5 minutes after the stimulus to that 
during the period from 0 to 2.5 minutes. The period from 2.5 to 5 minutes 
takes in most of the positive phase of the response, and the period from 0 
to 2.5 minutes gives the base line of normal growth. This ratio will be called R. 

If one considers the adapting intensity J, as a parameter and varies the 
stimulus S=J,-At, R turns out to be a logarithmic function of S/I,, 


(2) R~log S/T, . 


In other words, the stimuli have to be increased proportionally to the 
adapting intensities in order to produce the same growth velocity reactions. 
This relation is analogous to the well known Weber-Fechner law. These experi- 
mental findings give information about the Sensitivity of the speciments after 
they had been brought to equilibrium with an adapting intensity of illu- 
mination. 

We wish now to introduce a measure for the level of adaptation i.e. we 
want to introduce a quantity which in some manner characterizes the sensi- 
tivity of the specimen at any given moment, whether it is in equilibrium or not. 
The measure of adaptation will be most accurate if it utilizes responses in a 
region where the response changes most rapidly with the stimulus. A res- 
ponse k=1.4, which is produced by S/I, = 2? minutes is a suitable choice 
for this purpose. Our measure of the level of adaptation should obviously 
be proportional to the critical stimulus. One could make the proportionality 
factor equal to one, but this would be somewhat arbitrary since the critical 
Stimulus was defined with the aid of an arbitrarily chosen standard response. 
A more rational choice is achieved by the following line of thought. We 
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ask: with which intensity would we have to equilibrate the specimen in order 
to bring it to the same level of adaptation? This we will call the equivalent 
intensity. It is obtained by dividing the critical stimulus by 8 minutes. We 
will give this intensity the name A. After equilibration with I,, the level of 
adaptation is by definition A=J,. One is now in a position to outline a 
procedure for determining A also for some non-equilibrium states. The pro- 
cedure consists in bringing the speciment into the particulare state, testing 
it with various stimuli, determining by interpolation the stimulus giving res- 
ponse À —1.4 and dividing this stimulus by 8. 

Before proceeding with the presentation of our experimental data we should 
briefly restate the immediate goal. Our observational data concern with growth 
velocities in response to certain illumination programs. We speak of the growth 
output in response to an illumination input. The functional relation between 
these two is enormously influenced by a variable describing the internal state 
of the specimen, which we have called the level of adaptation A. We know 
that the variable itself is determined by the illumination program, and it con- 
stitutes therefore another and perhaps more immediate output of the illu- 
mination input. To explore this relation we measure A as a function of time 
after equilibration with various intensities or to a superimposed short sti- 
mulus of various sizes. 

The experimental results reveal the following facts about the connection 
between J and A. First of all we find that in each case of illumination A drops 
down exponentially by a factor two in 2.5 minutes after the light is turned 
off. Secondly we find in the case of a superimposed light flash of strength S 
that the level of adaptation rises from the level J, to the level 


(3) Ages by +8 


where J, is the level of adaptation to which the plant had been adapted before 
and b = 3.8 minutes (e-time) the time constant of the adaptation system. From 
these experimental findings we draw the conclusions, that the functional re- 
lation between I(t) and A(t) is described by the differential equation 


(4) b (CASINA EL. 


This equation obviously satisfies the basic findings that [=A after equi- 
libration with constant intensity, that A decreases in the dark exponentially 
with the time constant b, irrespective of its initial level and irrespective how 
it was brought to this level. For an arbitrary illumination program I(t) the 


equation has the integral 
i 


(5) A(t) = A(0) exp [— t/0] + (1/0) | I(s) exp 


0 
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The last equation implies that during a short stimulus A increases by 
(1/b) | I(s)ds = S/b in accordance with the last mentioned experiment. 

In the next step we have to deal with the coupling between illumination 
input, adaptation and growth output. The experimental results have shown 
that stimuli and adaptive levels have to be raised proportionally to produce 
the same growth output. We infer that what is relevant for the growth 
output is the ratio //A. This functional parameter we call i(t). 

Let us consider now what happens to i(t) during and after a short stimulus 
S superimposed upon a constant background intensity Z,. During a short 
square shaped stimulus of intensity 7, and duration At, A increased linearly 
from A_= 1, to A, — A_+ S/b. Therefore during the stimulus we have the 
relation 


(6) A(t) = A_+ St/bAt. 


Since this increase of A during the stimulus may represent an increase 
by a large factor, i(t) jumping from unity to a high value at the beginning of 
a stimulus, may decrease by a large factor even during the shortest stimulus. 
This decrease has to be taken into account in evaluation of the spike transient 
in the functional parameter è(t), which we conceive to be the quantity imme- 
diately responsible for the growth output. Let us define an internal stimulus s 
of the system as the integral of the transient during the stimulus. It works 
out as follows 


At 
(7) 8 = fera at = I,|/(A_+ St/b At) dt= blog (1+8/bA_). 


For strong stimuli, i.e. for stimuli which are in the range of those which 
give the standard response or stronger, s ist a logarithmic function of S in 
agreement with the above mentioned experiments. For small stimuli (small 
to those which give the standard response) the logarithm may be developed 
into a power series of S/bA_ and s becomes equal to the first term of the 
series. We then have s= S/A_, s is proportional to S. An experimental test 
of this last prediction seems impossible hence the fluctuation of the growth 
velocity does not permit to test this range of the reaction. 

The functional parameter i(t) has another important property. For an 
arbitrary illumination program which is preceded and followed by equilibration 
with the same intensity J, we have 


Id dA 
8 ta È; 
(8) fe na =| ( 7 jar fo =U. 
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The relation means that the positive and negative deviations of à resulting 
from any program which returns to the original intensity cancel exactly. This 
prediction was checked very carefully with two light programs, a pulse up 
and a pulse down program superimposed on a constant adapting intensity I,. 
The results show very clearly that there is not net gain or loss in growth in 
both of these experiments. If the program goes from equilibration with I, 
through any intermediate course to equilibration with another intensity L,, 
then we have the relation i 


(9) IL id = blog (li). 


The last relation was checked by step up and step down light programs 
and found in agreement with the above mentioned statement. 

These relations suggest that the functional parameter i(t) may be the im- 
portant variable which quite generally stands in a linear functional relationship 
to the growth output. By this we mean the following: At equilibrium (ft) 
rs always unity. Under the influence of a given illumination program it will 
deviate from unity in a predictable manner. Let us assume we could design 
an illumination program resulting in a i(t) program which deviates from unity 
only during a short period. Such a pulse in i(t) will lead to a growth output 
iepresented by a certain function of time. We now postulate two things: 
First that the growth output for pulses in i(t) of various sizes is equal to the 
output produced by a unit pulse in i(t) multiplied by the actual size of the 
pulse and secondly, that the growth output for an arbitrary illumination program 
can be calculated as a simple superposition of the outputs of all the pulses 
into which the functional parameter i(t) can be decomposed. These two postu- 
lates are formulated analytically in the next equation, in which Dv, (t) repre- 
sents the growth output due to a unit pulse in i(t), Dv(t) represents the actual 
output resulting from an arbitrary illumination program, and Di(t) represents 
the i(t) output of the program. D expresses that we are referring to deviations 
from the equilibrium values of the velocity and of <(¢), respectively 


t 


(10) Do(t) = fono — s) Di(s) ds. 


We have attempted to test the linear functional relationship between 7(¢) 
and v(t) in another manner which is less direct but more accurate, involving 
medium size periodic stimulations superimposed upon a constant intensity 
and comparing the growth outputs for two such programs in which the periods 
differ by a factor two. We cannot predict the v output for either one of these 
programs until we know the basic response function v,(t). However it can be 
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shown very easily that the » outputs of the two programs should be related 
to each other by the following equation 


(11) Do, (t) = Dv, q(t) + Dvgp(t + 2) , 


where 7 is the shorter of the two periods. This equation may be expressed 
by saying that the v output of the 7 program is obtained from the v output 
of the 27 program by superimposing the first and second half of the latter 
over each other. The predicted curves show very good agreement with the 
experimental curves. Actually the application of equation (11) presupposes 
that the specimens come to adaptive equilibrium with intensity J, in the 
intervals between stimulations. In our experiments this condition was met well 
in the 27 program but somewhat imperfectly in the 7 program. As a result, 
the stimuli in é(t) were probably a little smaller in the 7 program than in the 
2T program. The v outputs should be proportional to these stimuli. The size 
of this correction depends on the precise value of the time constant b of dark 
adaptation experiments, the correction amounts to 25%. This seems rather 
more than our experimental error. 

All these experimental findings can be concluded in the functional structure 
given below 


NE: à ) 
b> Linear transaucer| EL 


The light Z(t) controls the growth velocity v(t) in two ways. 


1) I(t) determines the adaptive level A(t) with the time constant b of 
3.8 minutes. The quantitative relationship between J and A is described by 
equation (4). 


2) I(t) determines the functional parameter i(t). The deviation Di(t) 
controls Do(t), the deviation of growth by a linear transducer. Since the ana- 
lysed filter connecting I(t) with v(t) only contains differential equations and 
one basic logical operation it transforms the «field of stimuli » into the « field 
of reactions ». 


Up to this point we have studied reactions to stimulations of the entire 
growing zone. Such an experimental procedure does not give any information 
about the question whether the functional structure of the filter process is 
built up by complicated interaction phenomena or simply by superposition 
of light growth reactions carried out by the parts (molecular groups) of the 
growing zone themselves. Let us take the last mentioned possibility as an 
hypothetical statement and ask: How could such a functional autonomy of 
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molecular groups in the growing zone be tested? The answer can be split into 
two parts; first the test for azimuthal and second the test for longitudinal 
autonomy of growth reactions (with respect to the geometry of the sporangio- 
‘phore). 

The first test was carried out during the last months (REICHARDT and 
VARJU [2]), the second one is still incomplete. The idea is as follows: When 
we studied the growth reactions, the specimens were illuminated bilaterally 
in such a way that the light distribution was a symmetrical one. The situation 
is quite different when the growing zone receives light only from one side. 
Im fact the zone acts as a converging cylindric lens. It concentrates the entering 
light on an area of only 20% of the backward side of the growing zone. There- 
fore we deal with an asymmetrical-light distribution. The result is the well 
known positive phototropic reaction. The sporangiophore bends with an an- 
gular velocity which remains constant as long as the light beam falls perpen- 
dicularly onto the growing segment (above the bend). The absolute amount 
«of the angular velocity is independent of the light intensity within the large 
range from 0.1 to 200 erg/em?s (the mormal range ») and drops down for 
higher and lower intensities The ‘phototropic reaction disappears for 
+3-10-5 erg/em? s on the lower end and for 2.2-10% erg/em? s on the upper end 
of the intensity range. These findings in mind, we designed the following 
experiment in order to study the azimuthal autonomy. The specimen were 
bilaterally adapted for nearly 50 minutes to an intensity J, near the lower 
end of the normal intensity range. For the time t=t, one of the light chan- 
nels was switched off and the other kept at the same intensity till the upper 
end of the sporangiophore reached the «normal » stationary angular velocity. 
Now the intensity I, of the light channel is suddenly raised to an intensity J, 
(also within the normal intensity range) and kept constant. What can be 
‘predicted on the phototropic reaction on such a light program? If there exists 
azimuthal autonomy with respect to growth reactions the net gain in growth 
‘on the fully illuminated side of the growing zone should be about five times 
larger than the net gain of the other cell side. This is partly a consequence 
of equation (9). We have to expect an inversion phase of the phototropic 
reaction as an answer to the one channel step function program. The experi- 
ments showed the predicted effect very clearly. After the specimen had re- 
ceived the light step function their positive phototropic reaction turned into 
a negative one. After this phase it came back to the normal stationary be- 
haviour of positive phototropism when the autonomous parts of the growing 
zone have adapted to the new intensity Z,. Before and after the inversion 
‘phase the angular velocity of the sporangiophore is the same since in both 
cases we stay in the normal phototropic range. The net effect of the reaction 
consists in a time shift At of the phototropic reaction curve. At turns out to 
be a logarithmic function of the intensity ratio 1,/I,. If one connects mathe- 
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matically the time shift At with the integral stretch of the growing zone and. 
determines the integral stretch to the step function program, the theory yields: 
the relation 


(12) At ~ log I,/I, , 


which is in accordance with the experiments. These findings favour the as- 
sumption of autonomous parts of the growing zone. 

Finally we have to raise the question, is it possible to explain the statio- 
nary phototropic response by the functional mechanism of the light growth 
reaction. We should like to propose the following model: BUDER (1918, 1920) 
has converted the converging lens property of the growing zone into a di- 
verging lens property by immersing the sporangiophore in a medium of higher 
refractive index than that of the protoplasm of the sporangiophore. Under 
these conditions the positive phototropism was converted into a negative one. 
In addition our experiments have shown that in the normal range of photo- 
tropic response the angular velocity of the sporangiophore is independent of 
the illumination intensity. From here we draw the conclusion that the in- 
tensity distribution of light on the surface of the growing zone determines the 
phototropie reaction. If one takes in account the circulation of the cell wall 
material around the axis of the sporangiophore one has to consider the growth 
output effects of the autonomously reacting parts as a funetion of the light. 
intensity distribution on their way. Rough calculations have shown that such 
a model would fulfil the main finding: the positive phototropic reaction to. 
an illumination from one side. The experiments on this part of the analysis: 
are still incomplete. 
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Vision of movement in all animals and men involves a physiological inter- 
“action between adjacent visual units. In the beetle Chlorophanus viridis this 
interaction was shown to be a process of cross-correlation. The eyes of this insect 
are composed of facets (ommatidia) which act as visual units in the process 
of perception of movement. The visual fields of adjacent ommatidia do not 
overlap. One point-like visual stimulus is received only by one ommatidium 
‘and not by its neighbors. The anatomical angle between the axis of two adja- 
cent facets is 6.8°. Many animals including Chlorophanus react to the per- 
ception of movement in their visual field by optomotor reactions. They follow 
the movement which they perceive by active turning reactions of their head 
or their body and so reduce the movement stimulus which they receive by 
their eyes. This may be described in terms of a feed back loop. 

The direction and strength of the optomotor response has been used as 
an indicator of the perception processes in the nervous parts of the eyes of 
the experimental animal. In the experiments the feed back loop of the re- 
action has been cut-off by fixing the animal so that its optomotor reactions 
could be observed by the experimenter but did not influence the position of 
the animal itself in relation to its optical environment. The experimental 
procedure (Y-maze-globe method) has been described (1951, 1958) in full detail 
elsewhere. 

Successions of practically point-like light stimuli were delivered to the eye 
of the experimental animal. If A, B, ©, D,... are adjacent ommatidia in an 
horizontal row and if the sign « + » is given to a stimulus which consists of 
an illumination change from darker to lighter the formula *A(t,) *B(t,) may 
describe a succession of two stimuli in adjacent ommatidia. The same suc- 
cession may be written also F{3(t,, %). The reaction of the animal to F,, 
may be symbolized by R';. 
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1. - Results. 


1) The simplest succession of light changes which is able to release am 
optomotor response consists of two stimuli in adjacent ommatidia. 


2) In producing optomotor responses each ommatidium can only co- 
operate with its immediate neighbor or with the next but one. There is no 
physiological interaction between ommatidia which are separated by more 
than one unstimulated ommatidium. 


3) The maximum reaction is given in the case of a time interval between 
two stimuli of about 150 ms. The strength of reaction decreases with both 
greater and smaller time intervals. The maximum time interval which was 
Shown experimentally to release a reaction was slightly over 10s. One must; 
conclude that the first stimulus has an after-effect of 10 s or more which later 
disappears. The real physiological interaction takes place between the after- 
effect of one stimulus and the effect of a following one. The first stimulus of 
a succession of two stimuli is modified by a filter which acts like a low pass. 


4) Ry, = Ri 
5) Ry, =+ RY. 
6) Rit = Ri + Ri + RE. 
VE = k= — R= —E 
i.e. Piz releases a negative optomotor reaction. 


8) As it has been known for a long time, a cylinder of gray stripes on 
white background releases weaker optomotor reactions than a cylinder of black 
stripes on white background which rotates with the same angular velocity. 
The strength of optomotor reactions of insects does not only depend on the 
velocity of the moving pattern but also on the amount of stimulus efficiency 
of the individual light changes of which the stimulus situation consists. In 
the following experiment I kept the time intervals of the stimulus succession 
constant and varied the stimulus intensities by using patterns of different gray 
shades with different contrasts. 


The result of this experiment was: The strength of reaction is a quadratic. 
function of the stimulus intensities. 

The experimental results 7 ) and 8) may be described together as follows: 
The direction and intensity of the optomotor response reflect the multipli- 
cation result of signs and intensities of the individual stimuli. There must, 
be a physiological mechanism which causes that the sensory input and the 


motor output are linked by a process which works according to the formula 
of multiplication. 


The experimental facts 1)-8) may be represented by Fig. 1. The omma- 
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tidia may be represented by A and B. The low pass filters are symbolized 
by their time constant +H and the multiplication process by M. 

9) W. REICHARDT calculated the output of the system of Fig. 1 for 
the following conditions: 


— 


a) Input A and B are stimulated by a moving 3 


8 
pattern which is at random composed of shades from 
white through black in such a way that both A and B =) € 
1 +47 CLOSE 
receive « white noise » stimulation. + 
| 


b) The output of the channel of B is subtracted A 
from the output of the channel of A (since both of | 
them represent movements of opposite directions). Ke NE 
v 


Fig. 1. 


c) The transmission is supposed to be linear in 
both the two low pass filters and the two other filters 
which represent the inertia of the conduction lines which 
cross between the two straight channels; the time constants of the filters 
are symbolized by 7, and 7,. 

REICHARDT found the output to be given by 


x | | x 
——|—exp|—— 

UTE UTD 
Thereby v is the velocity of the stripe pattern relative to the sensory inputs. 
æ is the spatial distance between the inputs. 

This theoretical result has been tested in the beetle Chlorophanus by 
measuring the strength of optomotor response to the movement of a pattern 
which was at random composed of Stripes of 10 shades of gray, each of the 
. width 0.57%. The result matched the theoretical curve very well if the time 
constants 7, and 7, were given the values 3500 ms and 46 ms. 


f(v) = const: (esp 


2, - Conclusion. 


The experimental facts Suggest that in the eye of Chlorophanus the eva- 
luation of movement in the visual field is made by a correlation process like 
that shown in Fig. 1. 
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